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Abstract 


Cache  memory  is  commonly  used  to  bridge  the  gap  between  microprocessor  and  memory 
speeds.  A  wide  variety  of  cache  designs  are  possible,  so  some  method  is  required  to  evaluate  the 
benefits  and  costs  of  the  various  alternatives.  Trace  driven  simulation  is  commonly  used  by  the 
computer  architecture  community  to  analyze  potential  designs.  Traces  of  benchmark  execution  are 
applied  to  a  model  of  the  design  under  study.  Most  of  today’s  computer  systems  have  been  optimized 
based  on  results  of  these  studies. 

One  important  aspect  that  is  frequently  ignored  in  trace  driven  studies  is  the  effect  of  the 
operating  system  and  multiprogramming  on  cache  performance;  most  traces  consist  only  of  a  single 
program’s  execution.  It  has  been  acknowledged  in  the  past  that  this  overhead  introduces  interference 
which  limits  the  benefits  of  new  designs,  but  evaluations  using  multiprogrammed  traces  have  been 
neglected  due  to  the  lack  of  readily  available  tools  that  can  capture  such  traces. 

In  this  research  we  describe  a  new  tracing  system  that  allows  the  capture  of  both  operating 
system  and  multiprogrammed  execution  data.  Cache  performance  is  studied  using  multiprogrammed 
traces  of  the  SPEC  benchmarks.  We  study  the  effects  of  considering  multiple  tasks  on  the  cache  miss 
rate.  The  performance  variation  is  primarily  due  to  the  presence  of  context  switches.  In  an  attempt 
to  extend  this  work,  we  develop  an  analytical  model  that  is  used  to  synthetically  incorporate  context 
switches  into  a  single  process’  trace. 

We  have  found  that  the  operating  system  introduces  a  small  but  persistent  overhead  to 
cache  performance.  Additional  processes  have  an  even  greater  impact,  which  increases  as  the  level 
of  multi-tasking  increases.  Spatial  locality  is  not  significantly  affected  by  these  conditions,  but  the 
temporal  locality  of  a  program  is  substantially  reduced  by  the  presence  of  context  switches. 
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1  Introduction 


The  technological  improvements  in  processor  technology  are  far  outstripping  the  advances 
made  in  memory  circuit  design.  As  processors  execute  faster  and  faster,  the  latency  experienced 
when  accessing  memory  becomes  a  major  limitation.  Faster  memory  is  available,  but  at  greater 
cost.  An  economical  balance  between  performance  and  price  is  achieved  through  the  use  of  memory 
caches.  The  main  memory  is  implemented  using  less  expensive  but  slow  technologies  such  as  SRAM, 
making  a  large  memory  feasible.  A  much  smaller  memory  cache  is  constructed  of  faster  (and  more 
expensive)  memory  circuits,  such  as  DRAM,  to  be  used  as  a  buffer  between  the  main  memory  and 
the  processor.  Sections  of  the  data  stored  in  main  memory  are  copied  into  the  cache,  allowing  it  to 
be  accessed  much  more  quickly.  Which  sections  of  memory  are  copied  into  the  cache,  and  how  the 
information  is  maintained,  is  a  function  of  the  design  of  the  cache  [22,  36,  52], 

A  cache  is  effective  in  reducing  the  average  memory  access  time  because  of  certain  properties 
found  in  software.  The  collection  of  instruction  and  data  addresses  used  by  a  program  over  some 
time  interval  is  referred  to  as  its  working  set  [3]  or  footprint  [56].  The  working  set  may  change  as 
the  program  executes,  but  it  generally  exhibits  two  properties: 

1.  spatial  locality,  and 

2.  temporal  locality. 

Spatial  locality  refers  to  the  property  that  addresses  tend  to  cluster  together  in  space.  References  may 
be  sequential  or  in  some  other  way  structured,  denoting  a  high  degree  of  spatial  locality.  Similarly, 
temporal  locality  refers  to  the  property  that  addresses  tend  to  cluster  together  in  time.  Addresses 
in  the  working  set  may  be  used  repeatedly  during  their  lifetime,  denoting  a  high  degree  of  temporal 
locality. 

These  two  properties  allow  caches  to  improve  memory  system  performance.  A  memory 
reference  which  is  not  in  the  cache  causes  a  cache  miss.  The  data  at  the  referenced  location  and 
some  number  of  its  adjoining  locations  is  brought  into  the  cache.  Due  to  locality,  it  is  likely  that 
either  the  same  location  (temporal),  or  nearby  locations  (spatial),  will  be  referenced  in  the  near 
future.  When  these  references  occur,  they  are  already  present  in  the  cache  and  a  cache  hit  ensues. 
On  a  hit,  the  data  can  be  very  rapidly  supplied  to  the  processor,  much  faster  than  an  access  to  the 
main  memory.  The  improvement  provided  by  a  cache  becomes  a  function  of  how  often  a  hit  occurs 
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and  how  fast  the  addressed  data  can  be  provided  to  the  processor,  balanced  by  the  delay  introduced 
when  servicing  a  cache  miss. 


The  critical  nature  of  caches  has  led  to  extensive  study  of  various  designs,  configurations, 
and  enhancements,  all  oriented  towards  increasing  cache  performance.  There  are  diverse  methods 
available  to  assess  the  alternatives,  ranging  from  prototyping  to  simulation.  Regardless  of  the 
method,  the  accuracy  of  the  evaluation  is  paramount.  The  criteria  used  to  justify  any  evaluation  must 
accurately  reflect  the  environment  to  which  the  cache  will  be  subjected,  otherwise  any  conclusions 
are  questionable. 

One  of  the  major  shortcomings  of  the  most  common  evaluation  methods  is  that  the  effect 
of  the  operating  system  and  multiple  user  processes  being  executed  are  neglected.  The  methods  are 
simpler,  but  ignore  a  major  aspect  of  the  computer’s  architecture.  Several  past  efforts  have  shown 
the  related  impact  is  significant  enough  to  warrant  inspection  [1,  2,  8,  11,  12,  41],  and  is  certainly 
a  more  realistic  representation  of  the  execution  environment.  The  drawback  is  the  difficulty  of 
incorporating  these  considerations  into  the  evaluation.  There  is  generally  some  overhead  required, 
in  time  and/or  resources,  to  perform  such  complex  tests. 

The  research  described  here  focused  on  developing  a  tool  to  capture  multiprocess  state 
information  and  perform  subsequent  evaluations,  exploring  its  capabilities  with  studies  in  both 
detailed  cache  simulations  and  testing  an  analytical  model.  This  thesis  is  organized  as  follows.  In 
section  2  cache  performance  and  evaluation  methods  are  reviewed.  Section  3  describes  the  analysis 
tool  ATOM,  and  how  it  can  be  used  specifically  on  the  operating  system  and  in  a  multi-process 
environment.  Section  4  discusses  the  methodology  followed  in  this  research  and  outlines  the  tests 
performed.  Section  5  reviews  the  results  of  simulations  performed  in  the  multi-process  environment. 
In  section  6  an  analytical  model  is  presented  that  can  be  used  to  simplify  simulations  with  minimal 
loss  of  accuracy,  which  is  tested  in  section  7.  Section  8  concludes  the  work,  with  a  summary  of  its 
contributions  in  section  9.  Last  are  section  10,  the  acknowledgments  and  section  11,  the  bibliography. 
Two  appendices  are  attached.  A,  copies  of  the  programs  used  in  this  research,  and  B,  tables  of  all 
simulation  results. 
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2  Background 

2.1  Cache  Performance 

Cache  performance  encompasses  a  variety  of  issues.  At  the  most  basic  level,  the  performance 
of  a  cache  can  be  defined  by  its  miss  rate  (or  ratio),  the  percentage  of  references  applied  to  the  cache 
whose  data  was  not  already  present  in  the  cache.  Alternatively  the  hit  rate,  which  is  the  percentage 
already  present,  may  be  referred  to.  The  two  values  represent  equivalent  information,  since  the 
miss  rate  equals  one  minus  the  hit  rate  and  vice  versa.  Depending  on  the  system  and  evaluation 
performed,  however,  this  metric  may  be  an  oversimplification.  The  goal  of  the  cache  is  to  improve 
the  average  memory  access  time,  which  is  a  function  of  more  than  just  the  miss  rate.  It  is  entirely 
possible  for  a  cache  to  have  a  low  miss  rate,  but  due  to  other  consideration  have  a  long  access  time 
thus  limiting  its  usefulness.  Hence  many  evaluations  are  based  not  on  miss  rates,  but  rather  refer  to 
the  cache  latency  [7,  8,  41,  47].  The  drawback  is  that  to  perform  an  evaluation  of  that  magnitude 
is  much  more  difficult  and  requires  modeling  a  greater  portion  of  the  system  under  test,  so  focusing 
simply  on  miss  rates  is  frequently  used  anyway. 

Regardless  of  the  standard  used,  the  cache  miss  rate  is  important,  as  the  average  access 
time  does  depend  on  this  value.  To  understand  the  significance  of  the  miss  rate,  it  is  important  to 
understand  the  various  sources  of  misses.  A  program  generates  a  stream  of  memory  references  as  it 
executes,  which  are  applied  to  the  cache.  Cache  misses  are  caused  when  an  address  in  the  reference 
stream  is  not  present  in  the  cache.  This  can  occur  for  basically  three  reasons  [3,  55]: 

Start  Up  The  first  form  of  miss  is  caused  the  first  time  that  a  particular  address  is  referenced  in 
the  stream.  Since  it  has  not  been  referenced  before,  there  is  no  expectation  that  that  memory 
location  would  have  been  copied  into  the  cache.  Such  misses  are  encountered  primarily  when 
a  program  begins  executing  and  all  references  are  new,  also  called  the  warm  up  phase  of  the 
cache.  The  size  of  the  cache  and  the  program  both  contribute  to  the  length  of  this  phase. 
As  the  working  set  changes,  additional  start  up  misses  are  encountered  as  new  locations  are 
referenced. 

Though  a  certain  address  may  not  have  been  previously  referenced,  it  is  still  possible  that  its 
data  is  already  in  the  cache.  When  data  is  copied  from  memory  to  the  cache,  it  is  moved  in 
quantities  called  blocks.  A  block  is  usually  larger  than  a  single  memory  access,  so  a  single  miss 
fetches  more  data  than  is  required  for  a  single  access.  If  a  location  is  referenced  that  resides  in 
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a  block  already  fetched,  it  will  hit,  even  though  that  particular  address  may  be  new.  This  is 
only  effective  for  memory  references  that  are  primarily  sequential,  such  as  instruction  fetches, 
in  which  case  a  large  block  size  is  beneficial.  Footprints  with  less  locality,  such  as  data  loads 
and  stores,  can  actually  have  the  reverse  effect  as  large  blocks  bring  in  excess  data  which  is 
never  used. 

Another  technique  to  prevent  start  up  misses  is  the  use  of  prefetching  [14,  15,  52].  This  is 
essentially  an  attempt  to  predict  what  locations  will  be  referenced  in  the  near  future,  and  fetch 
them  into  the  cache  before  they  are  requested.  The  method  of  prediction  can  be  hardware  or 
software  based,  and  must  be  accurate  for  prefetching  to  be  effective.  If  data  is  falsely  predicted 
and  fetched  into  the  cache,  it  may  overwrite  “live”  data  (live  meaning  that  it  is  still  part  of  the 
current  working  set),  causing  cache  pollution.  Additional  enhancements  such  as  a  pre  fetch 
buffer  filter  or  victim  cache  can  be  used  to  limit  this  impact  [22].  Using  prefetching  can  improve 
miss  rates,  however  it  also  increases  the  traffic  between  the  cache  and  memory.  An  accurate 
evaluation  cannot  consider  only  miss  rates  with  this  technique,  otherwise  its  drawbacks  will 
be  obscured. 

Capacity  The  second  form  of  miss  is  due  to  the  finite  cache  size.  A  large  program  cannot  possibly 
fit  its  entire  working  set  into  a  small  cache.  As  various  parts  of  the  working  set  are  used,  they 
will  overwrite  other  live  data.  The  obvious  solution  is  to  use  a  larger  cache,  but  at  additional 
expense.  Another  potential  solution  is  to  analyze  the  locations  used  in  the  working  set.  The 
references  may  cluster  around  certain  blocks  while  others  are  unused.  Changing  the  mapping 
of  addresses  to  cache  lines  (or  indices)  may  allow  the  references  to  be  better  distributed  across 
all  cache  lines  [7].  This  technique  is  also  an  effective  counter  for  the  next  type  of  miss,  which 
together  with  capacity  misses  are  sometimes  referred  to  as  intrinsic  interference. 

Conflict  The  third  form  of  miss  is  due  to  conflict  between  two  references.  If  two  addresses  in  the 
working  set  map  to  the  same  cache  line,  each  time  they  are  referenced  a  cache  miss  may  result 
(depending  on  the  actual  pattern  of  references).  Again,  altering  the  mapping  algorithm  may 
reduce  the  amount  of  conflict  in  a  given  reference  stream  by  spreading  out  clumps.  Another 
option  is  to  use  an  associative  cache  [22,  52].  In  this  form  of  cache,  each  cache  line  (sometimes 
called  set)  can  maintain  multiple  blocks,  so  multiple  locations  can  map  to  the  same  line  without 
conflict.  The  number  of  blocks  held  in  each  line  is  referred  to  as  the  set  size  or  associativity 
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of  that  cache,  and  can  vary  from  1  to  the  maximum  possible  given  the  available  chip  area. 
This  type  of  cache  can  be  pictured  as  a  two  dimensional  array  of  blocks,  with  the  vertical 
dimension  the  number  of  lines  and  the  horizontal  the  associativity.  The  bounding  cases  are 
a  direct  mapped  cache  with  an  associativity  of  one,  and  a  fully  associative  cache  with  only 
one  line.  The  drawback  is  that  for  a  finite  cache  area,  increasing  the  associativity  decreases 
the  number  of  cache  lines,  so  each  line  in  the  cache  has  more  locations  mapped  to  it  and  a 
corresponding  heavier  load.  Also,  associative  caches  are  frequently  slower,  which  should  be  a 
factor  in  comprehensive  evaluations. 

These  three  categories  comprise  the  basic  types  of  misses  found  in  a  process’  reference  stream.  They 
must  be  considered  in  even  a  minimal  performance  measurement,  although  there  are  other  cache 
components  that  may  improve  memory  system  performance  without  affecting  the  miss  rate. 

Other  cache  enhancements  which  do  not  directly  affect  miss  rates  are  usually  related  to 
access  times.  Techniques  such  as  using  a  Translation  Lookaside  Buffer  (TLB)  [49]  can  perform 
cache  lookups  and  virtual  address  conversions  in  parallel.  Other  methods  include  using  hierarchies 
of  caches,  such  as  a  small  direct  mapped  cache  on  chip  and  a  second  level  larger  cache,  possibly 
associative,  off  chip.  Using  combinations  of  caches  can  potentially  improve  the  performance  more 
than  a  single  highly  complex  cache  [52].  In  some  instances  an  entire  cache  is  not  added,  but  various 
buffers  or  filters  are  accommodated,  such  as  the  prefetch  buffer  or  victim  cache  [7]. 

The  cache  performance  will  depend  on  many  characteristics  of  the  cache.  Some  of  the  most 
basic  are  its  size  and  structure,  and  the  method  it  uses  to  resolve  both  hits  and  misses  for  each  ref¬ 
erence  type  (instruction  fetch,  data  read,  and  data  write).  Performance  enhancing  mechanisms  may 
also  be  included,  each  addressing  various  deficiencies.  Studies  have  shown  that  multiple  mechanisms 
in  concert  are  generally  the  most  effective  [47].  The  wide  variety  of  cache  designs  makes  the  ability 
to  evaluate  various  options  paramount,  and  there  are  concerns  that  have  yet  to  be  addressed  which 
further  complicate  analysis. 

So  far  in  this  discussion,  caches  have  been  considered  in  an  idealized  environment.  Modern 
computers  do  not  simply  execute  a  single  program  continuously  until  its  completion.  The  operat¬ 
ing  system  generates  its  own  references  as  system  calls  are  requested.  The  operating  system  also 
generates  references  for  processes  such  as  interrupt  services  and  other  management  tasks,  which  are 
performed  periodically.  Even  more  complex  is  a  multiprocess  environment,  with  multiple  programs 
or  threads  being  executed.  In  a  multitasking  system  there  are  several  processes  or  tasks  all  vying 
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for  system  resources,  one  of  which  is  memory.  In  a  uniprocessor  system,  control  is  accomplished  by 
time  sharing.  The  various  tasks  are  executed  for  finite  intervals  and  then  execution  is  switched  to 
another  process  —  called  a  context  switch.  As  each  task  is  scheduled  and  executed,  it  generates 
its  own  reference  stream  with  unique  characteristics.  The  individual  streams  are  interleaved  by  the 
context  switches  to  yield  an  aggregate  reference  stream  which  impinges  on  the  cache  [19,  31,  56]. 

This  introduces  a  new  mechanism  causing  a  fourth  and  final  type  of  miss,  transient  cache 
misses.  When  a  process  is  swapped  out  during  a  context  switch,  the  process  or  processes  that  execute 
until  the  original  process  is  returned  will  overwrite  its  cache  data.  This  data  may  still  have  been 
live,  so  the  overwrites  may  cause  additional  cache  misses  once  the  original  process  is  restored.  This 
is  referred  to  as  extrinsic  interference  [2],  as  opposed  to  the  intrinsic  interference  discussed  above, 
and  can  be  thought  of  as  a  reload  period  after  each  context  switch  as  evicted  data  is  returned  to 
the  cache  [56].  The  impact  of  extrinsic  interference  will  magnify  with  increased  multiprogramming 
as  the  duration  of  each  swap  is  extended,  although  this  can  be  partially  negated  by  stabilizing  the 
time  quantum  that  each  process  executes. 

Some  designs  call  for  the  cache  to  be  totally  flushed  (invalidated)  at  each  context  switch 
automatically.  This  might  be  appropriate  for  a  control  mechanism  such  as  the  cache  type  structure 
used  to  implement  a  TLB,  but  in  an  instruction  or  data  cache  it  is  quite  likely  that  some  of  the  live 
data  from  a  process  would  still  be  resident  when  that  process  returns  to  execution.  By  maintaining 
the  cache  data  for  as  long  as  possible,  the  extrinsic  interference  is  kept  to  a  minimum;  although  this 
does  require  additional  overhead  to  monitor  the  owner  of  each  line  of  cache  data,  and  complicates 
analysis  [22]. 

Other  architecture  issues  can  further  complicate  performance  consideration.  A  multipro¬ 
cessor  system  is  similar  to  what  has  already  been  discussed,  but  more  complicated.  Not  only  are 
multiple  reference  streams  being  generated,  they  are  generated  simultaneously  and  possibly  applied 
to  multiple  caches.  Each  processor  may  maintain  its  own  memory  structure  or  they  may  share 
a  common  structure.  This  raises  the  issue  of  cache  coherency,  or  the  property  that  data  stored 
in  memory  is  properly  maintained  in  each  location  it  is  represented.  If  multiple  processes  share 
memory  but  have  their  own  caches,  care  must  be  taken  to  monitor  when  data  is  in  multiple  caches 
(shared)  so  that  if  the  data  is  modified,  it  is  modified  in  all  caches.  Various  policies  can  be  used 
when  data  is  stored  to  the  cache,  such  as  write  through,  meaning  data  is  written  to  memory  as 
soon  as  it  is  written  to  cache,  or  write  back,  meaning  the  data  is  not  written  to  memory  until  it 
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is  evicted  from  the  cache.  Each  has  various  advantages  and  disadvantages,  and  in  turn  affects  the 
policy  used  to  maintain  coherence  [15,  29].  There  are  a  variety  of  other  technical  issues  as  well, 
such  as  communication  and  synchronization,  making  this  a  very  complex  design.  Even  more  radical 
departures  from  the  traditional  von  Neumann  architecture,  to  a  dataflow  architecture  for  example, 
cause  even  greater  difficulties  in  defining  evaluation  criteria  [30]. 

2.2  Cache  Analysis 

2.2.1  Methods 

There  are  a  variety  of  methods  available  to  evaluate  cache  performance.  General  reviews 
are  presented  in  [1,  11,  13,  60].  The  techniques  can  be  broken  down  into  various  categories: 

Analytical  Models  The  most  abstract  form  of  analysis  is  based  on  a  theoretical  prediction  derived 
from  the  test  system’s  characteristics  and  assumptions  of  how  it  is  loaded.  Developing  a  model  of 
the  system  under  test  requires  certain  assumptions  which  may  oversimplify  aspects  of  cache  design, 
neglect  relevant  characteristics  of  the  input,  or  may  not  be  sufficiently  verified  to  warrant  their  use. 
The  accuracy  of  the  evaluation  is  limited  by  the  accuracy  of  the  theoretical  model,  and  unfortunately, 
the  more  accurate  and  comprehensive  the  model,  the  more  difficult  it  is  to  solve  [3].  Some  models 
are  based  on  abstract  parameters  with  little  relation  to  the  actual  system  [31],  and  others  may 
require  considerable  test  program  characterization;  to  the  point  that  other  methods  would  be  equally 
suitable  [56].  The  most  successful  models  tend  to  focus  on  very  limited  aspects  of  memory  system 
performance  to  reduce  their  scope  [28,  55]. 

Hardware  Evaluation  The  antithesis  of  theoretical  analysis  is  hardware  evaluation.  In  this 
method,  the  test  system  is  implemented  and  inserted  into  some  platform.  Its  performance  can 
then  be  monitored  directly  as  the  platform  is  operated.  The  actual  analysis  is  quite  quick,  as 
the  processing  is  conducted  at  the  same  speed  as  the  platform,  however  the  test  system  must  be 
constructed,  which  may  be  a  slow  and  expensive  process.  The  other  disadvantage  is  that  to  test 
a  variety  of  alternative  designs,  each  alternative  must  be  constructed.  This  limits  the  flexibility 
and  can  be  even  more  costly.  Rapid  prototyping  can  make  this  method  more  attractive,  and  some 
examples  have  been  found  in  [11,  24].  Using  techniques  of  hardware  emulation  can  also  be  more 
efficient,  although  they  are  slower  [40]. 
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Trace  Based  Simulation  By  far  the  most  common  form  of  analysis  is  trace  driven  simulation. 
A  trace  of  program  references  is  generated  and  applied  to  a  model  of  the  system  being  tested.  The 
model  is  simulated  in  software,  and  can  be  as  complex  as  accuracy  dictates.  A  software  model  is 
very  flexible,  but  simulations  are  slower  to  compute.  Also,  the  traces  must  somehow  be  stored, 
which  requires  a  great  deal  of  memory,  although  they  can  be  reused.  The  trace  can  be  as  complex 
as  desired,  and  there  are  a  variety  of  methods  that  can  be  used  to  generate  it: 

Synthetic  Generation  Workloads  can  be  created  for  system  test  through  the  use  of  synthetic 
generators.  No  programs  need  be  executed,  reference  streams  are  simply  generated  randomly. 
Some  control  is  provided  through  defining  random  variables  and  their  distributions,  establish¬ 
ing  the  desired  characteristics  of  the  workload.  Since  it  is  artificially  generated,  however,  its 
accuracy  is  highly  suspect.  Various  examples  of  this  technique  can  be  found  in  [35,  46,  57,  58]. 

System  Emulation  Another  alternative  which  does  not  require  program  execution  uses  system 
emulation.  A  test  program  is  required,  but  it  is  fed  into  an  instruction  set  simulator  which 
generates  reference  stream  data.  This  pseudo  execution  of  programs  is  very  slow,  though,  and 
is  rarely  used  [60]. 

Hardware  Capture  The  last  two  methods  monitor  the  execution  of  a  test  program  on  some  plat¬ 
form,  capturing  the  reference  stream  as  the  program  executes.  In  hardware  capture,  the 
platform  is  modified  so  that  as  it  executes  the  test  code,  the  references  generated  are  collected 
and  stored.  It  is  easy  to  capture  a  wide  variety  of  references  in  the  trace  working  at  this  level, 
but  this  technique  suffers  from  the  disadvantage  of  requiring  unique  hardware  and/or  costly 
modification.  The  two  most  common  forms  of  hardware  capture  have  been  accomplished  by 
modifying  the  microcode  of  the  CPU  [1,  2],  or  by  using  test  probes  inserted  into  the  system 
to  electrically  read  the  system  status  [11,  60].  The  first  can  only  be  used  with  certain  archi¬ 
tectures,  however,  and  the  latter  is  limited  by  the  external  visibility  of  data  (for  instance,  an 
on  chip  cache  could  not  be  monitored).  Once  each  reference  is  captured,  there  are  a  variety 
of  ways  to  record  it,  such  as  storing  it  in  a  buffer  and  occasionally  writing  the  buffer  to  a  file. 
The  method  must  be  able  to  record  data  as  fast  as  the  system  generates  it,  which  may  be 
a  significant  limitation.  Despite  the  disadvantages,  this  method  is  frequently  used  in  certain 
situations  where  other  methods  may  not  be  feasible,  such  as  very  complex  architectures  [5,  59]. 
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Software  Capture  The  most  common  form  of  trace  generation  is  by  software  capture.  Instead  of 
modifying  the  testbed,  the  software  can  be  altered  so  that  information  about  the  program’s 
execution  is  recorded.  Again,  the  trace  is  generally  stored  in  a  buffer  until  it  can  be  written  out 
to  a  file,  although  there  are  alternatives.  Software  capture  is  more  flexible  than  hardware  based 
methods,  as  the  information  that  is  collected  can  be  easily  updated  as  evaluation  needs  change, 
but  capturing  all  aspects  of  the  reference  stream  (such  cis  the  operating  system)  can  be  difficult. 
Capture  can  be  based  on  snooping  programs  [50],  interrupt  generation  [32],  or  by  explicitly 
modifying  the  test  code.  This  modification  can  occur  during  compilation  [7,  8,  25,  43,  45]  or 
can  be  applied  to  an  existing  executable  [11,  12,  13,  54]. 

Extensions  There  are  also  various  extensions  that  can  be  used  with  the  above  techniques  to 
improve  their  efficiency.  For  instance,  one  major  drawback  of  trace  based  simulation  is  the  storage 
space  required  for  the  traces.  To  compensate,  it  is  possible  to  have  the  analysis  program  executing 
concurrently  with  the  trace  generation,  so  that  no  long  term  storage  is  required;  one  example  is 
[8].  This  does  preclude  reuse,  however.  Other  techniques  include  sampling  traces  to  reduce  their 
length,  although  this  may  affect  their  accuracy  depending  on  what  assumptions  are  made  in  the 
sampling  process  [1,  2,  6,  33,  61].  It  is  also  possible  to  simply  compress  the  trace  file,  but  this 
is  only  a  short  term  solution.  Other  extensions  include  using  various  processing  algorithms  such 
as  stack  based  processing  to  simplify  simulation  [48,  64],  or  reducing  processing  time  with  parallel 
computation  [42,  43,  63].  Analytical  models  can  be  used  in  conjunction  with  program  traces  to 
simplify  simulation  and  provide  evaluation  over  a  variety  of  system  characteristics  with  a  single 
execution  [3]. 

2.2.2  Issues 

The  evaluation  method  used  must  accurately  reflect  the  type  of  workload  that  would  be 
present  in  a  real  system.  This  is  particularly  a  concern  when  analytical  models  are  used,  as  programs 
may  not  be  executed  at  all,  so  a  statistical  approach  is  common  [57,  58].  For  hardware  measurement 
and  trace  based  simulation,  this  problem  is  addressed  by  selecting  appropriate  programs  to  be 
executed  in  the  evaluation.  Specific  programs  known  as  benchmarks  are  used  as  accepted  standards 
for  testing  [34,  45,  49].  There  are  differences  in  workloads  depending  on  the  type  of  programs  being 
considered,  whether  they  are  technical  or  commercial  applications  [37],  so  generally  multiple  test 
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programs  are  used  to  ensure  the  evaluation  is  comprehensive.  The  better  test  programs  will  have 
a  large  and  complex  footprint  to  exercise  the  cache  fully,  although  this  can  make  standardization 
more  difScult  and  analysis  slower. 

Once  a  workload  is  identified,  how  it  is  represented  and  used  in  the  analysis  can  vary.  If 
a  program  is  executed  or  traced,  there  are  a  variety  of  concerns  that  must  be  addressed  for  the 
evaluation  to  have  much  confidence  [1,  11,  13,  60]: 

Reference  Scope  The  simplest  forms  of  references  to  monitor  are  from  a  single  process  [7,  25, 
45,  61,  62],  but  though  they  are  easy  to  capture  they  are  also  not  particularly  a  realistic 
reflection  of  cache  loading.  Even  in  this  basic  form,  care  must  be  taken  to  ensure  that  shared 
libraries  and  other  common  structures  are  captured.  A  more  realistic  reference  stream  includes 
additional  processes,  and  if  possible,  the  operating  system.  Hardware  evaluation  of  a  cache  and 
hardware  based  trace  capture  for  simulation  do  allow  capture  of  all  references,  but  as  mentioned 
before  they  have  other  drawbacks.  It  may  be  difficult  to  identify  the  source  of  particular 
references,  too,  making  analysis  more  difficult.  Through  the  use  of  comprehensive  software 
capture  mechanisms,  it  is  possible  to  capture  traces  with  multiple  processes  [8,  41].  In  its  most 
complex  form,  this  mechanism  can  also  be  used  to  capture  traces  that  include  the  operating 
system  [1,  2],  however  a  thorough  understanding  of  the  test  system  is  necessary  for  proper 
implementation.  Such  references  are  more  difficult  to  capture,  and  present  a  new  problem 
in  processing.  The  multiprocess  environment  is  non-deterministic,  the  reference  stream  can 
vary  even  for  execution  of  the  same  test  programs  as  scheduling  and  interrupts  change  the 
execution  pattern.  For  a  truly  accurate  comparison,  all  tests  must  be  performed  from  a  single 
stored  trace,  or  they  must  all  be  performed  concurrently  from  the  stream  as  it  is  generated 
and  processed  [8]. 

Reference  Length  Another  accuracy  problem  with  reference  streams  are  their  length.  As  caches 
increase  in  size,  more  references  are  required  to  fully  exercise  them.  A  large  cache  can  contain 
a  large  footprint,  so  a  long  program  is  needed  to  generate  such  a  footprint.  This  is  particularly 
relevant  for  RISC  machines,  which  will  have  significantly  longer  traces  for  a  given  program 
because  of  the  increased  number  of  instructions.  Current  practices  call  for  on  the  order  of  100 
million  to  10  billion  references  to  be  an  adequate  [8].  Hardware  evaluation  places  no  constraint 
on  program  execution,  but  traced  based  methods  may  be  limited.  Early  tracing  mechanisms 
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could  not  generate  long  enough  traces,  so  shorter  traces  were  stitched  together  [1,  2].  In  other 
cases,  single  process  traces  were  interleaved  to  approximate  a  multiprocess  environment  [56]. 
Recently,  more  robust  methods  have  become  available  so  that  such  artificial  measures  are  not 
required  [13,  20].  Long  traces  are  difficult  to  manage  because  of  the  storage  space  they  require. 
Analysis  can  be  conducted  on  the  fly  so  the  traces  are  used  as  they  are  generated  [8],  or  the 
traces  can  be  sampled  to  reduce  their  length  [3]. 

Platform  Impact  The  operating  system  and  compiler  used  affect  cache  performance.  The  relative 
location  of  a  program’s  instructions  and  data  will  affect  the  amount  of  conflict  since  those 
locations  determine  which  cache  line  each  will  be  mapped  to.  Other  considerations  such  as 
data  alignment,  prefetch/flush  commands,  and  program  scheduling  will  also  affect  the  reference 
stream.  The  compiler  generates  code  optimized  for  a  certain  physical  memory  system,  so 
may  not  be  ideal  for  the  test  memory  systems  being  considered.  For  the  purposes  of  most 
evaluations,  this  effect  is  considered  to  be  equivalent  across  all  designs,  and  can  be  ignored, 
particularly  by  using  the  least  optimized  code  possible  [69]. 

The  memory  system  used  on  the  platform  will  also  affect  the  evaluations  performed  with  it. 
The  size  of  the  memory  can  produce  page  faults  and  other  activities,  which  in  turn  generates 
additional  overhead  references  that  would  not  have  occurred  in  the  modeled  system.  Other 
systems  may  dynamically  schedule  activities  based  on  the  system  state,  which  may  include 
memory  system  performance,  so  ordering  of  events  may  be  subtly  altered. 

In  certain  architectures,  the  scheduling  of  references  is  linked  directly  to  the  memory  system 
performance.  For  instance,  one  possible  method  to  hide  the  cache  latency  is  to  generate  a 
context  switch  on  any  cache  miss.  For  this  to  be  viable,  the  overhead  of  performing  a  context 
switch  must  be  less  than  the  latency  to  service  a  cache  miss.  If  this  is  the  case,  the  cache 
performance  then  plays  a  major  role  in  defining  the  reference  stream.  One  solution  used 
in  [38]  is  to  not  only  simulate  the  cache,  but  the  pipeline  and  instruction  set  as  well.  The 
test  program  executable  file  is  fed  into  the  simulation  which  executes  it  ’’virtually”.  Such  a 
simulation  is  very  comprehensive  but  also  quite  complex.  Parallel  systems  present  a  similar 
problem.  References  may  be  generated  for  one  system  and  a  variety  of  memory  configurations 
can  be  tested,  but  any  changes  to  the  architecture  of  the  underlying  system  may  totally 
invalidate  the  accuracy  of  the  reference  stream.  Also,  multiple  reference  streams  are  being 
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generated  simultaneously,  either  being  applied  to  the  same  cache  or  multiple  caches  that  must 
remain  consistent.  Generally,  such  complex  architectures  dictate  certain  types  of  evaluation 
methods,  using  either  synthetic  [46]  or  hardware  monitored  traces  [59]  for  analysis.  Another 
option  is  to  capture  robust  traces  with  more  information  than  just  simple  addresses  so  that 
the  execution  stream  can  be  re-created  for  a  variety  of  systems  [26,  32]. 

Reference  Mapping  When  a  reference  is  applied  to  the  cache,  it  is  mapped  onto  a  cache  line. 
A  simple  hashing  of  the  address  bits  may  be  used,  or  a  more  complex  algorithm,  possibly 
including  other  information  such  as  the  process  identifier  [52].  The  algorithm  can  vary  with 
the  system  and  depending  on  how  addresses  are  collected  it  may  be  relevant.  Depending  on  the 
capture  method,  the  addresses  generated  may  also  be  virtual  or  physical.  Virtual  addresses 
may  be  used  to  model  caches,  however  this  is  a  simplification.  The  actual  memory  system 
must  at  some  point  convert  all  addresses  to  physical  form.  This  conversion  affects  how  lines 
are  mapped  from  memory  to  the  cache,  so  it  is  relevant  to  cache  performance.  Unfortunately, 
converting  to  physical  addresses  is  a  very  complex  task  that  requires  considerably  more  system 
state  information  than  is  provided  by  a  basic  reference  trace.  Since  the  placement  of  programs 
in  memory  affects  their  mapping  into  the  cache,  the  loading  of  programs  into  memory  is  also 
relevant,  although  this  is  usually  controlled  by  the  operating  system. 

There  are  additional  concerns  relevant  to  particular  methods.  If  traces  are  captured,  care 
must  be  taken  so  that  the  act  of  tracing  does  not  affect  the  trace  generated.  Hardware  capture 
methods  tend  to  be  non-intrusive,  but  have  other  drawbacks.  Software  based  methods  in  particular 
are  very  intrusive  since  they  modify  the  test  programs,  and  certain  measures  must  be  taken  to 
compensate  [1,  11,  13,  60]: 

Address  Skewing  The  code  added  to  a  test  program  will  change  the  various  address  used  for  both 
instruction  fetches  and  data  accesses.  If  the  addresses  during  execution  are  used  directly  for 
the  analysis,  the  results  will  be  skewed.  Instead,  the  addresses  must  be  calculated  based  on 
what  the  reference  position  would  have  been  without  tracing.  This  is  normally  handled  by  the 
trace  generation  software,  and  can  be  transparent  to  the  simulation  model. 

Processing  Skewing  The  additional  code  inserted  into  a  program  can  also  cause  the  processing 
characteristics  of  the  test  program  to  be  skewed.  The  added  code  may  make  additional  calls 
to  system  resources  or  generate  additional  interrupts.  The  capture  mechanism  should  ideally 
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identify  the  source  of  references  so  they  can  be  discarded  if  not  generated  by  the  original  test 
program,  although  this  is  difficult  when  the  operating  system  is  considered. 

Program  Size  Since  program  size  is  increased,  certain  aspects  of  execution  will  be  changed  such 
as  paging.  The  larger  programs  will  occupy  more  memory  and  hence  require  greater  system 
overhead  to  manage. 

Program  Speed  The  program  speed  is  related  to  the  program’s  size.  The  additional  code  intro¬ 
duced  into  programs  can  easily  slow  down  their  execution  by  an  order  of  magnitude  [8].  The 
more  processing  introduced  by  tracing,  the  greater  the  slow  down  will  be.  This  affects  the 
accuracy  of  traces  in  two  ways.  Longer  programs  will  have  a  disproportionate  number  of  real¬ 
time  interrupts  during  their  execution.  Some  form  of  scaling  must  be  used  so  the  frequency  of 
this  type  of  interrupt  is  reduced  within  the  trace.  Neglecting  to  perform  the  service  routine 
is  possible,  however  may  affect  system  performance.  The  longer  programs  will  also  have  a 
disproportionate  number  of  context  switches  as  the  additional  code  can  both  cause  switches 
as  well  as  slow  down  the  original  program  so  that  less  is  accomplished  during  the  maximum 
execution  interval  allowed  by  the  scheduler. 

Once  such  concerns  are  addressed  for  a  given  evaluation  methodology,  an  analysis  can  be  performed 
with  a  great  deal  of  confidence  in  its  results. 

2.3  Current  Work 

As  early  as  the  late  1980’s,  the  impact  of  the  operating  system  and  additional  processes  was 
recognized  as  a  concern  in  memory  system  performance  [1,  2,  3].  More  recent  work  has  consistently 
validated  the  supposition  that  this  impact  was  significant  enough  to  warrant  further  study,  and 
should  be  included  in  any  comprehensive  memory  system  evaluation  [5,  11,  12,  13,  41,  59].  More 
importantly,  as  computing  capability  increased,  it  has  become  possible  to  capture  longer  and  more 
complete  traces  directly,  without  using  such  patch  work  measures  as  described  before. 

Much  of  the  recent  work  has  revolved  around  trace  driven  simulation  with  software  capture 
methods.  Many  studies  still  consider  cache  performance,  although  others  are  becoming  more  focused, 
looking  at  specific  areas  such  as  the  effect  different  operating  system  structures  can  have  on  memory 
system  performance  [11,  12].  Some  of  the  methods  used  are  either  proprietary  [37],  or  especially 
designed  for  a  certain  application  [62].  Some  generic  tools  have  been  generated,  such  as  Epoxie, 
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which  rewrites  assembly  code  to  generate  address  traces  [11,  12,  13]. 

Another  such  tool  is  ATOM,  very  similar  to  those  found  in  [11,  12,  13,  37].  Developed  by 
dec’s  Western  Research  Laboratory,  ATOM  is  a  general  purpose  program  analysis  tool  that  can  be 
customized  to  perform  a  wide  variety  of  different  evaluations.  Until  recently,  ATOM  focused  on  only 
the  single  process  environment,  but  in  its  latest  versions,  it  now  has  the  capability  to  capture  traces 
that  include  the  operating  system  as  well  as  multiple  user  programs.  This  research  has  revolved 
around  refining  this  capability  and  demonstrating  its  applicability  to  cache  analysis. 
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3  ATOM  Overview 

3.1  General  Use 


ATOM  (Analysis  Tools  with  OM)  [51]  is  not  a  specific  application;  rather  it  is  a  toolset  that 
can  be  used  to  produce  custom  analysis  tools.  It  provides  the  framework  to  generate  program  traces 
during  execution  and  pass  the  trace  data  to  analysis  routines  through  a  procedure  call  interface. 
The  analysis  or  simulation  program  is  actually  incorporated  into  the  test  program,  so  as  the  test 
program  is  executed,  so  is  the  tool.  This  procedure  is  commonly  referred  to  as  execution  driven 
simulation^  effectively  combining  the  act  of  tracing  and  analysis.  Tracing  of  this  type  alleviates  the 
need  for  trace  storage,  as  well  as  the  difficulties  of  synchronizing  a  separate  analysis  program  with 
the  test  programs. 

The  analysis  performed  can  vary  a  great  deal  due  to  the  flexibility  provided  by  ATOM. 
Tracing  is  performed  on  selected  events  such  as  program  start/stop,  basic  block  boundaries,  memory 
reads  and  writes,  instructions,  or  procedures.  Certain  types  of  a  given  event  can  be  selected  (i.e., 
a  certain  procedure  call),  or  all  instances  of  an  event  (i.e.,  every  instruction).  The  trace  capture 
is  inserted  as  a  function  call  to  an  analysis  routine,  so  that  when  a  particular  event  occurs  during 
execution,  information  about  that  event  is  passed  to  the  analysis  routine  where  the  event  data  is 
recorded,  processed,  or  in  some  other  way  used  to  perform  the  desired  evaluation. 

Given  this  type  of  framework,  tools  are  quite  easy  to  generate.  For  a  simple  cache  simulator 
with  a  single  process,  the  test  program  is  instrumented  at  every  instruction  fetch  and  at  every  data 
load  or  store.  The  memory  location  referenced  by  each  instruction  is  passed  to  the  analysis  routines 
corresponding  to  that  reference  type.  Within  the  analysis  routine,  the  cache  simulation  is  performed, 
so  that  when  the  test  program  concludes,  the  simulation  is  completed. 

The  specific  form  of  analysis  to  be  “instrumented”  into  the  test  program  is  incorporated  at 
link  time  by  ATOM  using  two  files: 

1.  the  instrumentation  file,  which  instructs  ATOM  which  events  to  trace  on  and  what  event 
information  to  pass  to  the  analysis  routines,  and 

2.  the  analysis  file,  which  defines  the  various  analysis  routines  and  any  other  subsidiary  functions 
required. 

It  is  a  very  simple  process  to  use.  The  test  program  is  compiled,  and  then  used  as  input  to 
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the  ATOM  program  with  the  following  example  command  line: 

Xatom  program. rr  inst.c  anal.c  -o  program. trace 

The  program  is  then  executed  and  the  desired  analysis  specified  by  inst .  c  and  anal .  c  is  performed. 
This  is  a  very  simple  example.  There  are  various  control  flags  that  ATOM  accepts,  these  are 
described  in  both  the  on-line  documentation  and  the  program  manuals. 

For  simplicity  it  is  also  possible  to  define  tools  for  ATOM.  A  tool  description  file  is  created 
which  specifies  which  instrumentation  and  analysis  files  to  use,  as  well  as  the  various  flags  to  pass 
to  ATOM.  The  programs  are  instrumented  with  a  tool  by  using  the  command  line: 

y,atom  program. rr  -tool  eval  -o  program. trace 

In  addition  to  simplifying  the  command  line,  defining  a  custom  tool  also  allows  additional  control 
flags  to  be  used.  The  basic  ATOM  command  line  does  not  accept  loader  flags,  for  example,  so  the 
flags  necessary  to  include  shared  libraries  such  as  math.h  (-Im)  cannot  be  used.  This  would  normally 
prevent  analysis  routines  from  accessing  such  basic  functions,  which  is  obviously  an  inconvenience. 
By  defining  a  tool,  it  is  also  possible  to  define  additional  flags  and  at  which  stage  of  instrumentation 
they  should  be  used  -  allowing  the  use  of  shared  libraries  and  other  linker /loader  flags. 

With  the  flexibility  provided,  ATOM  is  a  versatile  tool,  but  accuracy  is  still  a  potential 
problem.  Another  strong  point  for  ATOM  is  its  robustness.  In  the  cache  example  above,  one  major 
concern  is  the  fact  that  by  adding  additional  code  to  the  program,  the  reference  stream  becomes 
skewed  by  the  additional  instructions.  This  is  automatically  compensated  for  by  ATOM  during 
instrumentation,  so  that  the  addresses  passed  to  the  analysis  routines  are  those  of  the  memory 
references  without  tracing. 

Another  area  ATOM  excels  in  is  its  care  with  shared  libraries.  Many  simulations  totally 
neglect  shared  libraries,  which  may  be  a  significant  portion  of  the  code  depending  on  the  application. 
Programs  can  be  compiled  with  the  non_shared  option,  or  ATOM  can  instrument  the  shared  libraries 
as  well.  To  be  even  more  exact,  an  instrumented  and  non-instrumented  copy  of  the  shared  library 
routines  are  produced.  This  way  if  the  instrumented  program  calls  a  shared  library,  the  instrumented 
version  of  the  library  is  used.  If  the  analysis  routine  calls  the  same  library  function,  the  non- 
instrumented  version  is  used  so  that  the  analysis  is  not  corrupted. 

Until  recently,  ATOM  was  not  capable  of  tracing  the  operating  system,  and  was  not  partic- 
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ularly  suitable  for  tracing  multiple  test  programs.  The  latest  version  of  ATOM,  however,  does  allow 
instrumentation  of  the  operating  system.  The  initial  tests  of  this  facility  were  performed  by  Eustace 
and  Chen  in  [20],  but  some  aspects  were  not  particularly  well  addressed.  The  primary  focus  of  this 
research  has  been  to  further  test  and  build  on  their  work  [24]. 

3.2  Operating  System  Implementation 

With  the  latest  version  of  ATOM,  it  is  now  possible  to  instrument  and  study  the  operating 
system,  specifically  the  OSF  kernel.  It  is  treated  much  as  any  program  would  be,  albeit  a  very  large 
and  complex  one.  Because  of  the  unique  nature  of  the  operating  system,  there  are  certain  measures 
which  must  be  taken  that  are  not  required  for  a  normal  program.  Part  of  the  mechanism  used  to 
study  the  kernel  is  also  used  to  capture  traces  with  multiple  user  processes  as  well. 

3.2.1  Set  Up 

To  use  ATOM  with  the  operating  system,  some  modifications  are  usually  required  to  the 
test  platform.  More  memory  may  be  needed  to  execute  the  larger  programs,  128MB  is  recommended 
by  DEC.  The  larger  programs  will  also  require  more  swap  space  (256MB  recommended),  a  larger 
user  file  space,  and  an  expanded  root  partition  (up  to  60MB  depending  on  the  application).  ATOM 
version  2.20  or  later  must  be  installed,  with  the  WRL  enhancement  kit.  Both  are  available  from 
DEC  via  anonymous  FTP. 

Changes  are  necessary  to  allow  the  kernel  to  be  instrumented.  The  makefile,  normally  in 
the  /usr/sys  directory,  must  be  modified  and  the  kernel  remade.  The  two  modifications  required 
are: 

1.  The  LDFLAG  line  must  have  the  -ncr  flag  removed.  This  flag  removes  the  compact  relocation 
records,  and  is  not  compatible  with  ATOM. 

2.  The  ALPEA^TEXTBASE  must  be  increased  to  account  for  the  larger  kernel  size.  This  value 
represents  the  amount  of  space  in  memory  allocated  for  the  kernel  text,  usually  set  at  h230000. 
Instrumentation  increases  the  size  of  the  kernel  so  this  value  must  be  increased  accordingly. 
The  required  increase  will  vary,  so  occasionally  the  kernel  must  be  generated  twice.  First  a 
rough  estimate  of  the  necessary  increase  is  used  to  make  a  kernel  which  is  instrumented.  The 
nm  -B  command  can  then  be  used  to  calculate  the  actual  value  needed.  If  it  is  too  small,  the 
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kernel  will  crash,  and  if  it  is  too  large,  memory  may  be  wasted.  For  the  work  performed  here, 
a  value  of  h2C00000  was  used. 


Once  the  makefile  has  been  altered,  a  new  kernel  is  created  by  the  sequence  of  commands: 

#niake  clean 
#make  depend 
#make 

These  commands  must  be  executed  as  root;  using  the  sudo  utility  is  not  possible  as  the  kernel  will 
not  be  made  correctly.  During  testing  it  was  useful  to  have  multiple  kernels  available  with  different 
ALPHA_TEXTBASE  values  as  needs  changed.  If  multiple  kernels  are  made,  it  is  necessary  to  rename 
the  existing  kernels  before  a  new  one  is  created  as  all  existing  files  of  the  form  vmunix* .  *  are  erased 
during  the  make  process.  The  new  kernels  are  then  instrument  able  as  any  other  program. 

3.2.2  Programming 

The  act  of  instrumentation  inserts  function  calls  into  the  test  program.  These  functions  are 
executed  as  each  event  is  reached  during  program  execution,  performing  the  desired  analysis.  For  a 
cache  simulator,  those  events  are  instruction  fetches,  data  reads,  and  data  writes.  At  each  memory 
reference,  the  address  referenced  is  passed  to  the  analysis  function  for  processing  in  the  cache  model. 
Additional  functions  are  used  at  program  start  and  end  to  initialize  the  simulation  parameters  and 
report  the  simulations  results.  The  various  functions  and  the  instrumentation  are  defined  in  the  two 
ATOM  files  mentioned  previously  for  both  the  kernel  and  test  programs. 

To  incorporate  the  operating  system  into  the  analysis,  it  is  necessary  for  the  operating  system 
and  test  program  to  share  data.  The  cache  state  must  be  accessible  to  both  programs,  as  well  as 
other  counters  and  synchronization  flags.  This  sharing  can  be  accomplished  via  the  /dev/kmem  or 
/dev/mmap  utilities.  The  shared  data  is  local  to  the  kernel.  When  the  test  program  begins,  either 
of  the  utilities  is  used  to  map  the  shared  data  into  the  test  program’s  address  space,  where  it  can 
be  accessed  via  a  pointer.  Now  the  two  processes  have  a  common  data  structure  that  is  the  core  of 
the  simulation.  To  use  these  utilities,  there  are  two  requirements.  First,  the  test  programs  must  be 
run  as  root  to  access  the  /dev/  files.  Second,  two  copies  of  the  kernel  must  be  created.  One  is  the 
executable  which  is  actually  loaded,  the  other  is  a  debug  version  which  contains  the  symbol  table 
information  necessary  to  perform  the  mapping.  The  debug  version  stays  in  the  same  directory  as 
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the  test  programs. 

The  ability  to  share  data  is  the  also  key  to  capturing  traces  from  multiple  processes.  As 
described  above,  data  is  captured  from  two  processes,  the  kernel  and  the  user  program.  As  will 
be  seen,  the  same  technique  can  be  used  to  increase  the  number  of  processes  being  captured.  The 
example  above  uses  shared  cache  state  data,  but  any  set  of  data  may  be  shared  to  provide  the  desired 
capture  information. 

The  instrumentation  and  analysis  files  are  not  substantially  different  for  the  kernel  and  user 
programs.  For  the  kernel,  a  test  must  be  used  to  ensure  that  certain  procedures  are  not  instrumented 
(see  below).  For  the  test  program,  the  shared  data  must  be  mapped  at  program  start  and  the  data 
recorded  at  program  end.  Otherwise,  the  analysis  functions  may  be  more  or  less  the  same.  For  the 
cache  simulator,  a  process  identification  value  is  passed  with  the  address  so  that  the  sending  process 
is  recognizable. 

Figure  1  shows  logically  how  the  original  code  and  analysis  routines  work  together  to  perform 
the  desired  analysis,  in  this  case  the  cache  simulator. 


Figure  1:  Program  Block  Diagram 
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3.2.3  Execution 


Once  the  required  files  are  written,  the  implementation  is  not  substantially  different  from 
that  of  any  other  test  program.  The  two  instrumented  versions  of  the  kernel  are  produced  with  two 
slightly  different  command  lines.  For  the  executable: 

'/•atom  vmunix  kern.inst.c  kern. anal. c  -Xkernel  -Xgprog  ~o  vmunix. trace 
and  for  the  debug  version: 

•/.atom  vmunix  kem.inst.c  kem.anal.c  -Xkernel  -g  -o  vmunix. debug 

The  various  test  programs  are  also  instrumented  as  described  above.  The  executable  version 
of  the  kernel  is  moved  to  root,  and  the  system  is  restarted  with  the  #shutdown  -h  now  command. 
Using  boot  -f  1  i,  the  system  is  restarted  and  the  instrumented  kernel  is  specified  and  loaded.  The 
testbed  is  frequently  shutdown,  so  it  was  helpful  to  have  a  dedicated  system  for  this  research  so  that 
other  work  was  not  interrupted.  Once  the  kernel  is  running  at  the  desired  execution  level,  the  test 
programs  are  then  executed  normally,  performing  the  analysis.  It  is  recommended  that  a  batch  file 
be  used  to  run  test  programs  to  simplify  testing. 

3.3  Problem  Areas 

3.3.1  ATOM  Limitations 

Certain  characteristics  of  ATOM  define  limitations  on  the  instrumentation  which  can  be 
used  within  the  Unix  kernel. 

•  Since  it  is  the  operating  system,  tracing  cannot  be  based  on  the  program  end  event. 

•  Certain  kernel  procedures  cannot  be  instrumented.  These  are  the  locore,  lockprim,  and 
spl  libraries,  which  account  for  only  132  out  of  10,678  kernel  procedures  so  the  error  induced 
should  be  negligible. 

•  Floating  point  numbers  cannot  be  used  within  the  kernel. 

•  The  ATOM  model  used  when  simulating  dynamic  memory  allocation  is  not  accurate  within 
the  kernel,  so  analysis  of  this  aspect  of  program  execution  is  suspect. 

•  No  system  call  interfaces  can  be  used  within  the  kernel. 
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Most  of  these  limitations  are  not  particularly  significant,  although  the  last  is  inconvenient.  Without 
system  calls,  file  10  is  not  possible,  which  precludes  using  a  file  to  set  evaluation  parameters.  This 
makes  it  very  difficult  to  dynamically  define  analysis  parameters,  so  in  many  cases  the  programs  and 
operating  system  must  be  re-instrumented  for  each  desired  evaluation  (i.e.  a  separate  run  for  each 
cache  configuration).  Many  other  shared  library  routines,  such  as  mathematical  functions,  are  also 
unavailable.  As  future  versions  of  ATOM  are  released,  hopefully  some  of  these  shortcomings  will  be 
addressed. 

3.3.2  Kernel  Limitations 

Working  with  the  kernel  also  entails  certain  problems,  especially  for  a  programmer  unfamiliar 
with  the  operating  system  environment.  The  kernel  is  difficult  to  manipulate,  requiring  special  access 
privileges.  The  critical  nature  of  the  program  requires  careful  handling,  although  based  on  previous 
work,  instrumentation  errors  will  not  damage  the  system  —  a  kernel  improperly  instrumented  will 
usually  not  even  boot.  The  primary  difficulty  of  working  with  an  operating  system  is  the  difficulty 
in  debugging.  Most  debugging  tools  cannot  be  used  to  debug  a  kernel,  and  many  of  the  error 
messages  generated  are  cryptic.  Initial  testing  of  instrumentation  code  should  be  done  on  generic 
user  programs,  and  only  when  working  on  that  level  should  it  be  attempted  on  the  kernel.  This 
provides  better  checking,  and  a  much  faster  debug  and  test  cycle.  Working  with  the  kernel  is  a  slow 
process.  Making  a  new  kernel  takes  up  to  8  minutes,  and  each  instrumentation  can  take  as  much, 
if  not  more,  time.  Even  eissuming  a  new  kernel  is  not  required,  to  test  a  kernel  usually  takes  about 
20-30  minutes  (as  compared  to  the  almost  instantaneous  results  from  a  simple  user  program).  Even 
with  debugging  on  a  user  program,  many  problems  will  only  appear  in  the  kernel,  so  in  general, 
development  is  very  slow.  Some  of  this  may  have  been  due  to  system  limitations,  but  only  a  minor 
improvement  should  be  expected  with  better  resources. 

There  were  three  obscure  errors  found  regularly  during  kernel  testing: 

1.  KSP  INVAL 

2.  bootstrap  address  collision:  image  loading  aborted 

3.  trap:  invalid  memory  access  from  kernel  mode 

The  first  error  can  occur  when  the  kernel  is  loaded  or  during  execution.  This  is  roughly  equivalent 
to  a  segmentation  violation  which  is  normally  caused  by  a  misuse  of  pointers.  This  error  may 
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also  be  caused  by  running  out  of  memory,  if  there  is  not  enough  stack  or  heap  for  the  kernel  to 
execute.  The  second  message  always  appears  during  kernel  loading.  This  is  caused  by  an  incorrect 
ALPHA_TEXTBASE  assigned  in  the  makefile.  The  nm  -B  command  should  be  used  to  determine  the 
correct  value  and  the  kernel  remade.  The  final  error  always  occurs  during  test  program  execution. 
This  was  an  intermittent  error  and  the  cause  was  never  found,  even  after  conferring  with  DEC. 
The  error  always  occurred  in  the  kernePs  thread_preempt  routine  which  suggests  it  is  related  to 
interrupts  and/or  context  switching.  The  error  was  linked  to  the  size  of  the  test  programs  being 
executed.  A  single  large  program  could  cause  the  error  (such  as  Xlisp),  or  combinations  of  smaller 
programs  (such  as  Alvinn  with  any  other  program,  or  Compress,  GCC,  and  Espresso  all  together). 
Since  it  occurred  with  only  one  test  program  running,  it  cannot  be  caused  by  having  two  or  more 
test  programs  sharing  the  kernePs  data  structure.  The  memory  of  the  testbed  was  increased  from 
64  to  160MB  with  no  effect.  The  hardclock  scaling  (see  below)  was  reduced  to  its  minimum  value  of 
50%  with  no  effect.  To  isolate  the  problem  it  will  be  necessary  to  complete  an  examination  of  the 
kernel  which  is  beyond  the  scope  of  this  work.  The  most  likely  cause  is  the  threaded  execution  of 
the  kernel  and  the  lack  of  firm  control  within  the  analysis  routines;  although  it  is  possible  that  the 
hardclock  scaling  is  the  culprit. 

3.3.3  Program  Size 

One  common  problem  with  any  software-based  tracing  method  is  the  increase  in  program 
size.  Since  the  program  is  instrumented  with  not  only  tracing  information,  but  also  analysis  func¬ 
tions,  this  is  a  greater  concern  when  ATOM  is  used.  The  normal  OSF  kernel  is  about  8-9MB.  If 
the  same  kernel  is  instrumented  with  a  function  call  at  every  instruction,  and  an  additional  call 
at  every  data  read  or  write,  the  kernel  will  grow  to  92.7MB  and  require  an  ALPHA_TEXTBASE  of 
about  hSAOOOOO.  A  kernel  this  size  could  not  even  be  loaded  on  the  test  machine.  By  instrument¬ 
ing  groups  of  instructions  (and  still  each  data  reference),  the  kernel  is  only  about  46MB  with  an 
ALPHA_TEXTBASE  of  h2C00000,  which  is  executable.  Instrumenting  just  instruction  or  data  accesses 
will  reduce  the  size  by  about  half.  It  is  important  to  note  that  the  size  of  the  instrumented  kernel  is 
primarily  a  function  of  the  degree  of  instrumentation,  not  analysis.  Changing  the  amount  of  analysis 
processing  only  varied  the  size  of  the  kernel  by  about  4MB. 

Besides  the  strain  on  the  system  from  working  with  such  a  large  kernel,  it  also  raises  an 
accuracy  issue.  The  kernel  used  in  our  tests  left  only  15MB  of  memory  available  for  test  programs. 
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yet  this  is  supposed  to  be  simulating  a  system  with  about  50MB  of  free  memory.  The  situation  is 
even  worse  when  the  fact  that  each  test  program  is  also  instrumented  and  significantly  larger  than 
normal  is  considered.  Such  large  programs  require  more  paging,  which  in  turn  skews  the  amount 
of  overhead  each  program  requires.  For  more  accurate  results,  the  amount  of  memory  should  be 
increased  proportionately. 

3.3.4  Execution  Speed 

Execution  speed  becomes  critical  when  considering  the  instrumented  kernel.  The  inclusion 
of  tracing  can  reduce  the  execution  speed  of  a  program  by  an  order  of  magnitude  [8] ,  more  so  with 
the  additional  processing.  A  slowdown  of  this  magnitude  may  not  be  tolerated  by  the  operating 
system.  At  some  point,  the  kernel  becomes  so  slow  that  it  cannot  function  correctly.  Interrupts  and 
service  requests  may  be  generated  feister  than  they  can  be  serviced,  effectively  hanging  the  system 
during  boot  up.  This  can  also  be  seen  during  test  program  execution  if  too  many  processes  are 
executed  —  the  kernel  simply  thrashes  and  the  system  stalls.  Even  assuming  the  operating  system 
does  work,  basic  tasks  can  take  an  inordinate  amount  of  time.  Booting  a  kernel  with  a  basic  cache 
simulator  in  multi-user  mode  and  logging  on  took  over  an  hour  in  one  test.  Several  methods  have 
been  explored  to  accelerate  the  kernel  and  counter  this  problem. 

The  first  is  to  use  a  different  programming  style  for  the  kernel  analysis  routines.  Only 
the  bare  minimum  code  necessary  to  perform  the  desired  task  is  used.  No  additional  function 
calls  are  made  beyond  the  initial  call  to  the  analysis  routine,  eliminating  extra  switching.  Any 
additional  computation  is  incorporated  into  the  primary  function,  even  if  this  requires  duplicating 
code.  Loops  should  be  used  sparingly  and  the  iterations  minimized,  and  any’  other  time  consuming 
operations  should  be  optimized.  Minimizing  data  storage  may  help,  but  is  not  a  primary  factor. 
These  techniques  will  definitely  speed  execution,  particularly  eliminating  function  calls,  so  even 
though  some  of  these  changes  introduce  poor  programming  practice  from  a  software  engineering 
standpoint,  they  need  to  be  used. 

If  the  kernel  boots,  but  is  too  slow  to  execute  the  test  programs  in  a  multi-user  environ¬ 
ment,  the  first  solution  is  to  reduce  the  number  of  additional  processes  the  kernel  may  be  executing. 
Programs  being  run  by  other  users  or  not  part  of  the  test  should  be  eliminated.  Other  background 
processes  associated  with  the  operating  system  can  also  be  killed.  In  multi-user  mode,  there  are  ad¬ 
ditional  background  processes  executing,  such  as  LAT,  cron,  network  software,  and  printer  daemons. 
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Many  of  these  are  not  necessary  for  the  tests  and  can  be  removed  —  the  fewer  processes  running 
the  faster  the  kernel  will  be. 

If  the  kernel  is  still  to  slow,  or  will  not  boot  in  multi-user  mode,  it  is  possible  to  run  the 
programs  in  single  user  mode.  This  effectively  eliminates  all  extraneous  processes  and  dedicates 
the  system  to  the  instrumented  test  programs.  When  the  system  boots  to  the  first  #  prompt,  do 
not  start  the  higher  execution  level  (the  command  is  ""D).  The  local  disks  can  be  mounted  using 
#mouiit  -at  uf  s  so  that  the  test  programs  can  be  accessed  (assuming  they  are  on  a  local  disk).  The 
simulations  can  then  be  executed  normally.  If  multiple  test  programs  are  desired,  they  can  be  run 
concurrently  by  using  background  mode  (&)  for  each.  Using  single  user  mode  is  significantly  faster, 
and  can  be  considered  an  advantage  or  disadvantage.  It  is  true  that  most  of  the  processes  that 
would  be  executing  in  a  “real”  environment  are  absent,  lessening  the  accuracy,  however  it  also  lets 
the  analysis  focus  on  the  operating  system  overhead  associated  with  a  particular  program  without 
all  the  other  extraneous  references.  The  use  of  single  user  mode  will  depend  on  both  the  constraints 
of  the  kernel  and  the  desired  evaluation.  Single  user  mode  may  also  limit  the  choice  of  test  programs. 
Some  programs,  such  as  SC  in  the  SPEC  benchmark  suite,  require  specific  interfaces  which  may  not 
be  available  and  so  cannot  be  executed. 

If  the  kernel  is  so  slow  that  it  cannot  even  be  booted,  it  may  be  necessary  to  disregard  some 
of  the  real-time  interrupts  that  are  stalling  the  system.  The  main  interrupt  of  concern  is  the  system 
call  to  the  hardclock.  The  number  of  the  hardclock  calls  which  are  performed  can  be  scaled  by  using 
assembly  code  [10].  This  allows  a  certain  percentage  of  the  interrupts  to  be  ignored.  This  has  by 
far  the  most  significant  impact  on  kernel  speed,  and  should  be  sufficient  to  allow  most  programs  to 
execute. 

The  speed  factor  also  raises  a  question  of  accuracy.  Any  event  that  is  based  on  an  absolute 
timing  mechanism  (such  as  real  time  interrupts)  will  not  be  affected  by  instrumentation.  That 
means  that  as  an  instrumented  program  executes,  it  sees  a  disproportionate  number  of  these  events 
during  its  execution.  The  hardclock  scaling  mentioned  above  will  partially  resolve  this  issue,  but  it 
has  not  been  fully  verified.  Another  accuracy  factor  is  the  number  of  context  switches.  If  a  system 
uses  a  maximum  execution  interval,  the  frequency  of  context  switches  seen  by  an  instrumented 
test  program  will  also  be  out  of  proportion.  One  measure  used  in  [8]  is  to  increase  the  maximum 
execution  interval  defined  by  the  task  scheduler. 
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3.3.5  Re-entrance 


One  of  the  most  complex,  and  possibly  significant,  aspects  of  working  with  the  kernel  is  its 
multi-threaded  nature.  System  calls,  interrupt  service  routines,  and  other  overhead  functions  are 
all  separate  processes  to  be  executed  by  the  processor.  They  may  be  executed  at  any  time  during 
program  or  analysis  execution.  This  causes  a  problem  of  guaranteeing  the  integrity  of  the  analysis 
data.  For  example,  during  execution  of  the  test  program,  the  analysis  routine  is  called.  While 
the  analysis  routine  is  still  processing  that  particular  event,  an  interrupt  occurs.  The  interrupt 
will  supersede  the  analysis  routine  and  the  interrupt  service  routine  will  be  executed.  The  service 
routine  is  part  of  the  kernel,  and  is  also  instrumented.  Therefore,  as  the  service  routine  executes, 
it  also  generates  events  and  calls  to  the  analysis  routines,  before  the  prior  analysis  routine  call  has 
completed.  Since  all  analysis  routines  access  a  common  data  structure,  the  actual  state  of  the  data 
becomes  non-determinate  and  the  evaluation  results  inaccurate.  Consider  an  analysis  routine  which 
is  interrupted  in  the  middle  of  incrementing  a  counter.  The  counter  is  loaded  and  incremented,  but 
has  yet  to  be  stored.  The  second  execution  of  the  analysis  routine  also  increments  the  counter,  so 
it  loads,  increments,  and  stores  the  data.  The  problem  is,  the  value  the  second  routine  loaded  was 
incorrect,  since  the  first  routine  never  had  a  chance  to  store  the  new  value  of  the  counter.  When  the 
first  routine  does  return  to  execution,  it  then  writes  the  value  of  the  counter,  which  eliminates  any 
changes  to  the  counter  that  occurred  during  the  interruption.  Analysis  functions  must  be  designed 
explicitly  to  handle  such  concerns,  called  re-entrant,  since  they  can  effectively  be  ‘‘entered”  multiple 
times  without  loss  of  integrity. 

Further  data  thrashing  is  possible  during  a  context  switch.  At  a  context  switch,  the  current 
state  of  the  processor  is  saved  so  that  when  that  process  returns  to  execution,  it  is  started  from 
the  point  where  it  was  swapped  out.  This  current  status  is  usually  represented  by  data  such  as  the 
registers  and  allocation  tables.  In  a  threaded  program,  however,  there  may  be  data  that  is  visible  to 
all  processes  and  not  stored  at  the  context  switch.  If  this  data  is  relevant  to  the  state  of  a  particular 
process,  it  must  be  explicitly  defined  as  such.  For  instance,  one  process  sets  a  variable  in  the  global 
data.  This  data  is  carried  over  a  context  switch  and  is  now  visible  to  the  next  process,  where  it  may 
or  may  not  affect  its  execution.  If  the  communication  is  intentional,  care  must  be  used  so  that  a 
context  switch  performed  in  the  act  of  setting  the  variable  will  not  disrupt  the  execution.  For  this 
reason,  the  scope  of  data  should  be  kept  as  local  as  possible,  and  any  global  data  must  be  protected. 
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Re-entrance  is  normally  achieved  through  synchronization.  Each  time  a  particular  function 
is  entered,  it  must  determine  if  it  is  unique  or  if  there  are  other  instances  of  that  function  in 
mid  execution.  This  is  accomplished  by  a  semaphore  or  other  form  of  signal  which  is  visible  to 
all  instances  of  every  function.  Such  global  data  can  be  used  to  coordinate  the  activities  of  each 
function,  the  actual  implementation  depending  on  the  desired  effect.  For  the  synchronization  to  be 
effective,  it  must  be  an  atomic  operation.  The  two  acts  of  checking  the  semaphore  and  setting  it  if 
it  is  not  already  set  cannot  be  interrupted,  otherwise  synchronization  may  be  lost.  For  example,  a 
process  checks  the  signal  and  determines  that  it  is  the  first  instance  of  that  analysis  function.  Before 
it  can  set  the  signal,  however,  an  interrupt  occurs  and  the  function  called  again.  This  instance  also 
checks  the  signal  and  determines  that  it  is  the  first,  conflicting  with  the  legitimate  first  instance. 
Normal  instructions  do  not  provide  this  capability,  as  an  interrupt  may  quite  easily  occur  between 
testing  and  changing  a  variable.  Instead,  particular  commands  must  be  used,  which  will  depend  on 
the  platform  used. 

The  task  of  making  analysis  routines  re-entrant  is  further  complicated  by  the  fact  that  the 
analysis  routines  are  being  executed  within  the  kernel.  There  are  many  libraries  of  thread  control 
and  synchronization  routines  such  as  pthxeads.h,  semaphore.h,  signal. h,  and  others,  but  these 
are  mostly  services  provided  hy  the  kernel,  not  available  within  the  kernel.  To  make  the  analysis 
routines  fully  re-entrant,  it  will  be  necessary  to  incorporate  the  same  synchronization  used  within 
the  kernel,  which  is  not  well  documented. 

In  some  cases  the  error  introduced  by  data  corruption  is  small  enough  that  it  can  be  toler¬ 
ated.  In  other  cases,  contrived  re-entrance  can  be  incorporated  with  basic  programming  to  insure 
some  protection.  For  a  detailed  analysis  of  a  multithreaded  program  such  as  the  operating  system, 
however,  full  re-entrance  will  be  required.  This  problem  has  not  been  addressed  before,  and  will 
require  substantial  investigation  before  it  is  adequately  resolved, 

3.3.6  Reference  Stream  Accuracy 

The  threaded  nature  of  the  operating  system  also  raises  accuracy  concerns.  Through  testing, 
it  has  been  determined  that  there  is  no  duplication  of  kernel  software  similar  to  that  used  for  shared 
libraries  in  single  process  simulation.  This  means  that  if  the  analysis  routine  in  the  test  program 
makes  a  system  call  or  instigates  an  interrupt,  then  the  instrumented  kernel  service  routine  is 
executed.  This  in  turn  generates  additional  references  for  the  simulation  which  would  not  have  been 
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generated  in  the  untraced  version  of  the  program.  This  is  a  significant  concern,  particularly  if  the 
execution  of  the  operating  system  is  to  be  analyzed  in  detail.  Since  all  real-time  interrupt  routines 
are  instrumented,  they  generate  additional  references  as  well  since  there  is  proportionately  more 
interrupts  per  program  execution  time.  To  counter  this,  there  must  be  an  explicit  mechanism  to 
determine  the  cause  of  the  operating  system  references  and  disregard  the  additional  references  — 
possibly  something  to  incorporate  as  an  aspect  of  the  re-entrance  mechanism. 

3.3.7  Portability 

The  final  area  of  concern  is  ATOM’s  portability.  One  criticism  of  many  of  the  past  methods 
was  their  lack  of  portability.  Some  are  custom  tools,  and  many  were  tied  to  a  specific  architecture  or 
program.  It  is  unfortunate  that  ATOM  is  no  exception.  ATOM  has  only  been  implemented  for  the 
DEC  Alpha  workstations  and  the  operating  system  aspect  can  only  be  used  with  DEC  OSF/1.  The 
one  advantage  ATOM  does  have  is  its  flexibility.  Since  it  is  a  generic  framework  based  on  software, 
that  framework  can  be  reconstructed  for  other  platforms  or  operating  systems.  The  tools  already 
created  can  then  be  used  to  compare  results  across  systems.  Because  of  this  it  is  hoped  that  one 
day  ATOM  will  be  available  for  other  systems,  which  is  entirely  possible. 
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4  Test  Methodology 

4.1  Cache  Model 


Fundamentally,  a  cache  is  simply  a  device  used  to  store  subsets  of  a  large  data  pool  for 
quick  access.  This  type  of  structure  may  be  found  in  a  TLB  [49],  memory  mapping  tables  [52],  or 
within  an  instruction  pipeline  [27].  The  most  common  form,  and  that  which  is  modeled  here,  is  a 
memory  cache  used  to  improve  average  memory  access  times  by  storing  data  mapped  in  from  main 
memory.  The  design  and  execution  of  such  caches  have  been  rigorously  studied,  and  are  described 
in  a  variety  of  sources  [22,  36,  52]. 

The  goal  for  this  research  was  to  develop  a  flexible  cache  simulator  that  incorporates  ref¬ 
erence  streams  from  multiple  processes,  including  the  operating  system.  This  was  built  on  the 
framework  outlined  in  the  previous  section,  using  a  common  data  structure  in  the  kernel’s  address 
space  to  provide  synchronization  and  store  the  cache  state.  The  test  program  mapped  this  struc¬ 
ture  into  the  program’s  address  space  by  accessing  the  /dev/mem  facility,  so  all  test  programs  must 
be  executed  as  root  (moot  point  in  single  user  mode).  To  perform  a  single  process  simulation  for 
comparison,  the  code  was  slightly  modified  so  that  the  cache  data  was  local  to  the  test  program, 
external  communication  and  synchronization  were  no  longer  necessary.  The  code  used  is  provided 
in  appendix  A,  but  a  summary  of  the  most  significant  characteristics  is  provided  below. 

The  default  ATOM  tools  only  incorporate  one  test  program  and  the  operating  system.  By 
using  the  same  technique,  however,  it  is  possible  to  extend  a  simulation  to  an  arbitrary  number  of 
programs.  Each  program  simply  maps  the  same  kernel  data  structure  into  its  space  via  a  pointer  so 
each  process  now  has  access  to  the  same  common  memory  structure.  In  this  way,  simulations  can 
be  conducted  with  multiple  test  programs  with  the  operating  system. 

For  simplicity,  the  various  analysis  files  were  implemented  as  custom  ATOM  tools.  This 
allowed  the  use  of  shared  library  functions  such  as  math.h  within  the  analysis  functions,  as  well  as 
simplified  the  act  of  instrumenting  each  test  program.  The  tools  defined  for  this  research  are: 

kexe  This  specified  the  kernel  instrumentation  and  analysis  programs  with  the  ATOM  flags  neces¬ 
sary  to  produce  an  executable  version  of  the  kernel. 

kdbg  Kdbg  also  specified  the  kernel  instrumentation  and  analysis  programs,  but  with  the  ATOM 
flags  required  to  produce  the  debug  version  of  the  kernel  used  to  map  memory  addresses. 
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user#  The  final  tool  was  used  for  the  test  programs.  The  #  symbol  represents  a  digit,  1,  2,  or  3, 
which  identifies  which  test  program  is  being  instrumented.  The  only  difference  is  the  process 
identification  number  assigned. 

The  program  captures  both  instruction  and  data  references  to  be  able  to  model  both  split 
and  unified  instruction  and  data  caches.  This  is  relatively  simple  for  a  RISC  architecture;  each 
instruction  generates  one  instruction  reference,  and  all  data  references  are  one  of  two  possibilities,  a 
data  load  or  data  store.  Instrumenting  every  instruction  generates  too  large  a  kernel  to  be  executed 
on  our  system.  Instead,  instructions  are  instrumented  within  basic  blocks  in  groups  of  8  or  less. 
This  both  decreases  the  size  of  the  programs,  and  speeds  their  execution.  The  processing  routine  is 
passed  the  initial  address  and  the  number  of  instructions  that  follow  to  simplify  processing.  With 
this  information,  the  addresses  of  each  instruction  can  be  recreated  and  processed.  It  is  also  possible 
to  only  instrument  each  bcisic  block,  but  grouping  instructions  presents  a  problem.  To  simulate  a 
unified  cache,  the  interleaving  of  instruction  and  data  references  in  the  same  stream  is  required. 
If  instructions  are  instrumented  in  groups,  the  actual  interleaving  cannot  be  reconstructed.  Data 
references  could  be  out  of  place  by  as  many  references  as  the  number  of  instructions  grouped  together. 
For  this  reason,  instructions  should  be  instrumented  individually  if  possible.  Using  smaller  blocks  of 
instructions  minimizes  this  error,  and  also  allows  another  simplification  in  processing.  If  the  groups 
of  instructions  are  smaller  than  the  cache  block  size,  then  only  one  reference  need  be  processed  for 
the  entire  group  and  the  reference  counter  incremented  by  the  group  size.  A  small  margin  or  error 
is  introduced  because  of  the  assumption  that  instructions  are  aligned  along  blocks,  but  this  will  be 
minimal  as  block  size  increases.  This  was  used  in  the  simulator,  limiting  the  minimum  cache  block 
size  to  32  bytes  given  a  4  byte  instruction. 

Each  reference  is  applied  to  its  appropriate  cache  according  to  the  cache’s  characteristics. 
The  caches  themselves  are  defined  by  4  or  7  parameters,  depending  on  cache  type: 

Type  Either  split,  containing  separate  instruction  and  data  caches  (type  =  1),  or  unified,  having  a 
single  cache  for  both  types  of  references  (type  =  0). 

Cache  Size  The  cache  size  in  number  of  bytes.  The  size  is  specified  as  an  area,  so  that  the  number 
of  cache  lines  in  a  given  cache  is  determined  by: 

cache  size 

block  size  *  associativity 
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Cache  size  is  specified  independently  for  each  section  of  a  split  cache,  as  are  the  last  two 
parameters. 


Block  size  The  size  in  bytes  of  a  cache  block,  which  is  the  unit  of  transfer  between  the  cache  and 
memory. 

Associativity  The  number  of  blocks  per  cache  line. 

For  most  simulations  of  this  type,  such  parameters  must  be  staticaly  defined  during  compilation, 
which  makes  repeated  tests  with  a  range  of  parameters  difficult.  This  is  because  the  kernel  cannot 
access  file  10  so  simulation  data  cannot  be  loaded  when  the  program  starts.  This  program  instead 
defines  maximum  parameters  during  compilation  and  memory  is  allocated  for  a  worst  case  condition. 
When  the  operating  system  is  started,  the  simulation  also  starts  but  with  a  flag  so  that  all  references 
are  discarded.  When  the  first  test  program  is  executed,  it  loads  the  desired  cache  parameters  from 
a  file  and  stores  them  into  the  cache  structure,  thereby  allowing  dynamic  definition  of  simulation 
parameters.  Once  this  is  completed,  reference  capture  is  enabled  and  the  simulation  commences. 
This  also  speeds  up  the  operating  system  when  a  simulation  is  not  actually  being  performed,  since 
after  all  test  programs  have  completed  the  flag  is  restored  and  the  simulation  portion  disabled. 

Other  cache  characteristics  are  constant.  These  are  programmed  into  the  simulation  and 
cannot  be  modified  without  code  changes: 

•  The  various  threads  encompassing  the  kernel  are  treated  collectively  as  a  single  process. 

•  Caches  are  virtually  addressed.  A  process  identifier  is  associated  with  each  cache  block  to 
identify  its  owning  process,  so  cache  flushes  on  context  switches  are  not  necessary.  This 
neglects  aliases,  or  multiple  virtual  addresses  to  the  same  physical  location,  but  the  effect  of 
such  shared  data  should  be  minimal  given  the  test  programs  used.  If  multiple  threads  of  a 
single  process  such  as  the  kernel  are  to  be  considered,  however,  this  cannot  be  ignored.  Using 
virtual  addresses  drastically  simplifies  the  simulation,  since  no  translation  to  physical  addresses 
is  necessary,  but  it  does  have  a  drawback.  The  virtual  addresses  for  a  program  will  depend 
on  the  system  executing  it  and  how  it  has  been  mapped  from  memory.  This  mapping  may  be 
optimized  for  a  particular  memory  system  or  the  current  execution  environment,  and  so  skew 
the  results  of  a  simulation  of  a  different  system  on  the  same  addresses.  This  must  be  accepted 
unless  the  virtual/physical  mapping  is  also  considered  in  the  model,  which  is  not  a  simple  task. 
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Since  the  effect  will  be  consistent  across  all  programs  and  caches  in  the  simulation,  its  impact 
is  ignored. 

•  No  prefetching  (also  called  demand  fetching)  is  incorporated  into  the  simulation.  This  is 
not  particularly  realistic,  since  pre-fetching  is  a  simple  but  powerful  enhancement  to  cache 
performance,  but  for  an  initial  test  of  the  simulation  capability,  it  becomes  an  unnecessary 
complication. 

•  All  references  are  cissumed  to  be  the  same  size,  accessing  a  single  byte.  This  is  acceptable 
assuming  that  any  words  addressed  do  not  cross  cache  block  boundaries. 

•  Mapping  of  addresses  to  cache  lines  is  by  a  simple  masking  of  the  low  order  address  bits.  This 
is  the  most  simple  and  common  form,  although  other  hashing  algorithms  are  possible. 

•  An  allocate  on  write  policy  is  used,  so  data  writes  are  treated  the  same  as  reads.  This  is 
generally  the  most  pessimistic  write  policy,  as  opposed  to  its  opposite,  no  fetch  on  write,  in 
which  a  data  write  miss  is  ignored  by  the  cache  and  sent  directly  to  memory  [29].  Write  back 
versus  write  through  considerations  are  ignored,  as  the  model  does  not  consider  traffic  to  main 
memory. 

•  Set  associative  caches  use  a  least  recently  used  (LRU)  replacement  algorithm. 

Cache  performance  is  recorded  as  reference  and  miss  totals  for  each  type  of  reference.  Totals 
are  generated  separately  for  each  process  for  each  cache.  Values  are  reported  at  the  end  of  the 
simulation;  for  multiple  processes  at  the  end  of  each  process.  Process  overwrite  data  is  also  captured, 
in  the  form  of  the  total  number  of  overwrites  by  each  process  over  each  of  the  other  processes.  This 
is  accumulated  by  incrementing  a  particular  counter  identifying  the  previous  and  present  owning 
process  for  each  cache  block  overwritten.  Cache  performance  information  for  the  operating  system 
is  only  captured  during  the  execution  of  test  programs.  References  before  or  after  the  program  are 
ignored. 

One  concern  was  that  in  a  multiprocess  environment,  execution  is  non-deterministic.  Be¬ 
cause  of  this,  multiple  executions  cannot  be  used  to  evaluate  multiple  caches,  as  there  will  be 
differences  between  each  execution.  To  counter  this,  multiple  caches  with  varying  characteristics  are 
simulated  during  a  single  execution.  This  way,  cache  performance  can  be  compared  across  equivalent 
loading.  It  does  slow  down  execution,  but  accomplishes  more  with  one  run. 
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Another  concern  was  the  threaded  characteristics  of  the  operating  system  analysis,  some 
form  of  re-entrance  was  required.  To  address  this,  a  flag  is  set  upon  entry  to  the  ATOM  analysis 
routines.  The  flag  is  a  global  variable  visible  to  all  of  the  executing  processes,  so  can  be  used  for 
synchronization.  If  an  analysis  routine  encounters  the  flag  already  set  on  entry,  it  immediately 
exits,  maintaining  data  integrity.  By  assuming  that  the  reference  which  called  the  analysis  routine 
Wcis  in  some  way  instigated  by  another  analysis  routine,  this  also  prevents  interrupts  generated  by 
the  analysis  routine  from  contributing  to  the  simulation  reference  stream.  It  does  cause  any  other 
interrupts  which  occur  during  analysis  processing  to  be  neglected  as  well.  While  this  may  seem  like  a 
disadvantage,  such  real-time  interrupts  are  normally  skewed  by  the  slowed  processing,  so  neglecting 
a  portion  of  them  is  actually  beneficial.  This  implementation  is  not  ideal,  because  the  flag  is  not  set 
or  cleared  as  an  atomic  operation.  The  majority  of  signaling  and  synchronization  protocols  available 
in  programming  are  actually  services  provided  by  the  kernel,  and  therefore  not  available  to  code  that 
is  executing  within  the  kernel.  If  an  interrupt  occurs  in  the  process  of  checking  or  setting  the  flag, 
the  execution  is  undetermined.  This  was  particularly  a  problem  during  context  switches,  so  another 
mechanism  was  added.  Not  only  do  the  analysis  routines  check  the  signaling  flag,  but  they  also  check 
to  see  if  a  context  switch  has  occurred.  If  a  context  switch  has  occurred,  the  flag  is  automatically 
reset.  This  is  obviously  a  very  improvised  strategy  and  has  much  room  for  improvement,  but  it  was 
eflFective  in  regulating  the  reference  stream  enough  to  allow  reasonably  accurate  simulations. 

Other  aspects  of  the  code  were  dictated  by  the  use  of  ATOM.  As  mentioned  in  the  previous 
section,  all  processing  was  kept  to  a  minimum.  Loops  were  used  sparingly,  and  no  function  calls 
beyond  the  original  analysis  routine  were  used.  This  is  not  particularly  good  software  engineering 
practice,  but  necessary.  The  hardclock  scaling  mentioned  was  also  incorporated,  with  a  90%  reduc¬ 
tion  in  the  number  of  hardclock  calls.  Even  with  these  measures,  the  instrumented  operating  system 
was  slow  enough  that  it  was  also  necessary  to  perform  all  simulations  in  single  user  mode.  Multiple 
processes  could  still  be  used  by  executing  them  in  background  mode. 

The  program  developed  is  a  very  comprehensive  and  flexible  simulator  with  a  great  deal  of 
potential,  but  it  does  have  some  problems  discovered  in  hindsight  that  should  be  addressed  in  future 
work. 


•  Program  size  is  still  a  concern;  more  memory  is  definitely  needed  to  reduce  paging  for  more 
accurate  simulations.  Increasing  memory  should  also  improve  execution  times. 
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•  Program  speed  is  also  still  a  concern.  Ideally,  the  scheduler  should  have  been  modified  so 
that  instrumented  programs  use  a  longer  maximum  execution  interval  to  accommodate  their 
decreased  speed  as  done  in  [8], 

•  The  block  replacement  data  showing  process  overwrites  is  not  distinguished  by  reference  types. 
This  is  an  oversight  and  limits  the  potential  usefulness  of  the  data,  as  it  is  impossible  to 
determine  the  contribution  of  each  type  of  reference  to  the  amount  of  interference. 

•  Using  virtual  addressing  is  simplistic  and  raises  other  issues.  Physical  based  addressing  should 
be  used  if  possible. 

•  The  impact  of  the  existing  memory  system  and  architecture  are  not  considered,  simply  assumed 
to  be  consistent  and  neglected. 

•  The  methods  used  to  correct  timing  problems,  such  as  scaling  hardclock  interrupts  and  ignoring 
interrupts  during  analysis,  are  not  verified.  An  extensive  analysis  should  be  conducted  to 
demonstrate  or  refute  their  effectiveness. 

•  The  synchronization  used  is  very  fragile.  Ideally  the  synchronization  method  used  within  the 
kernel  should  be  studied  and  incorporated  so  that  the  analysis  code  is  truly  re-entrant.  This 
is  particularly  necessary  for  more  reliable  analysis  of  threaded  programs. 

Even  with  these  potential  problem  areas,  however,  the  program  was  capable  of  performing  most 
of  the  desired  simulations,  and  provided  an  adequate  validation  of  the  multi-process  capability  of 
ATOM. 

4.2  Verification 

To  have  any  confidence  in  the  results  of  a  simulation,  the  simulator  must  first  be  verified 
to  ensure  that  it  does  indeed  produce  accurate  results.  The  developmental  nature  of  this  project 
precluded  a  direct  comparison  with  other  equivalent  work.  Default  tools  are  provided  with  ATOM 
which  can  incorporate  the  operating  system,  but  do  not  have  the  flexibility  to  verify  the  range  of 
cache  types  that  will  be  simulated.  Other  tools  are  not  readily  available  to  generate  comparable 
simulations.  Instead,  a  multi  step  approach  was  used  to  demonstrate  the  program’s  correctness. 

The  first  concern  Wcis  the  ability  of  the  program  to  accurately  capture  the  address  traces. 
This  was  accomplished  by  writing  a  second  ATOM  based  application  that  simply  captured  traces 
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without  performing  any  other  processing.  The  references  it  captured  were  compared  to  those  cap¬ 
tured  by  the  simulator,  which  were  identical.  The  second  ATOM  tool  was  simple  enough  that  it 
could  be  verified  by  inspection,  so  if  it  does  not  capture  the  address  traces  correctly  then  any  flaw 
is  within  the  ATOM  framework  and  cannot  be  addressed  here. 

The  next  aspect  to  be  verified  was  the  processing  of  the  reference  stream.  The  program  was 
slightly  modified  so  that  as  each  reference  was  processed,  it  was  also  stored  to  file.  A  trace  file  was 
generated  for  the  following  four  benchmarks: 

•  Compress 

•  Ear 

•  Espresso 

•  SC 

for  the  three  caches  shown: 

•  Unified  8192  byte  2  way  associative  cache  with  64  byte  blocks 

•  Split  2048  byte  fully  associative  caches  with  32  byte  blocks 

•  Split  4096  byte  direct  mapped  caches  with  32  byte  blocks 

The  trace  file  was  then  used  as  input  to  the  DineroIII  cache  simulator  to  test  the  cache  processing. 
DineroIII  and  simulation  results  were  identical  for  all  12  cases. 

A  further  test  was  used  to  ensure  the  simulation  program  executed  correctly.  The  results  of 
single  process  simulations  were  compared  to  the  results  of  benchmark  cache  analysis  in  other  papers 
[25,  45],  The  cache  performance  was  roughly  the  same  in  that  the  same  general  behavior  patterns 
were  present,  however  there  were  some  differences.  This  is  primarily  due  to  differences  in  the  inputs 
used;  in  some  cases  alternate  or  combinations  of  inputs  different  than  those  used  here  were  simulated 
by  the  previous  research.  Their  results  were  also  generated  from  optimized  code  which  disregarded 
shared  library  references.  For  our  tests,  code  was  not  optimized  and  all  references  are  captured,  so 
the  difference  is  to  be  expected. 

The  final  concern  regarding  the  simulator  was  its  repeatability.  Given  the  threaded  environ¬ 
ment,  results  could  vary  within  a  single  execution.  Given  the  non-deterministic  environment,  results 
could  also  vary  over  multiple  executions  so  an  experiment  was  conducted  to  determine  the  extent  of 
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the  possible  variation.  The  same  three  caches  mentioned  above  were  simulated  for  Compress,  Ear, 
and  Espresso  5  times  each  in  succession.  Each  simulation  modeled  ten  identical  caches.  The  first 
results  showed  that  not  only  did  performance  vary,  but  so  did  the  reference  load.  Each  successive 
execution  of  the  same  program  after  the  initial  execution  had  a  reduced  number  of  references  from 
the  kernel.  Upon  reflection,  we  realized  that  this  was  due  to  the  overhead  required  for  the  first  exe¬ 
cution  of  loading  the  program  into  memory.  All  following  executions  had  reduced  operating  system 
overhead  since  the  test  program  was  already  in  memory,  as  can  be  seen  in  Figure  2. 
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Figure  2:  Operating  System  Instruction  Fetches  Over  Repeated  Program  Execution 

To  eliminate  this  factor,  the  tests  were  repeated  without  having  each  program  executed 
sequentially.  The  variation  was  reduced,  but  not  eliminated.  For  complete  accuracy,  the  system 
was  rebooted  between  all  later  simulations.  The  second  set  of  results  highlighted  another  problem. 
In  the  output  file,  the  operating  system  references  varied  even  through  the  process  of  recording  the 
results  to  file.  Figure  3  shows  the  number  of  kernel  instruction  references  for  ten  identical  caches 
from  the  same  simulation.  The  increasing  number  of  references  for  the  later  caches  suggests  the 
point  made  in  the  previous  section,  that  in  the  operating  system  environment,  ATOM  does  not 
correctly  distinguish  between  calls  to  common  code  made  from  the  test  and  analysis  sections  of  the 
program. 

The  variation  within  a  single  simulation  was  also  due  to  the  threaded  nature  of  the  analysis, 
so  the  pseudo  re-entrance  measures  discussed  above  were  then  incorporated  into  the  program.  They 
eliminated  the  majority  of  the  operating  system  references  generated  by  the  simulation  routines,  as 
well  as  prevented  most  of  the  data  thrashing.  The  simulations  were  again  repeated,  although  only 
for  the  Espresso  benchmark  and  only  for  2  split  caches,  fully  associative  and  direct  mapped.  These 
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Figure  3:  Operating  System  Instruction  Fetches  Within  Same  Program  Execution 

results  showed  no  variation  at  all  within  a  single  execution,  and  only  a  minor  variation  of  .01  to  .1 
in  the  cache  miss  rates  between  different  executions.  Prior  to  these  measures  being  taken,  the  worst 
variation  was  substantially  less  than  was  expected,  however  using  a  single  user  mode  for  execution 
limits  the  number  of  extraneous  processes  and  greatly  reduces  the  non-determinism  of  execution. 
With  the  additional  precautions,  we  are  confident  in  the  accuracy  of  the  simulation  results. 

4.3  Simulations 

4.3.1  Platform  Information 

The  described  tests  were  performed  on  a  DEC  Alpha  3000  model  300,  a  RISC  based  AXP 
architecture.  The  root  partition  had  to  be  expanded  to  85MB  to  accommodate  the  larger  kernels 
used,  which  could  contain  up  to  a  48MB  test  kernel  in  addition  to  the  normal  root  residents.  The 
swap  space  was  originally  195MB  which  proved  to  be  insufficient  to  instrument  large  programs.  A 
second  local  disk  was  added  increasing  the  swap  space  to  323MB.  The  usr  partition  was  694MB  which 
was  generally  adequate  although  more  space  was  useful  at  some  points.  The  added  disk  included 
a  1090MB  scratch  directory  which  proved  to  be  invaluable  in  storing  results,  traces,  kernels,  and 
other  files.  The  critical  factor  was  memory.  The  system  only  had  64MB  of  main  memory,  so  during 
simulations  only  about  15MB  of  memory  was  available  for  test  programs.  For  future  efforts,  the 
memory  must  be  increased  to  improve  simulation  performance  and  accuracy. 

The  operating  system  used  was  DEC  OSF/1  version  3.2A  Unix  kernel.  Newer  versions  are 
available  however  this  version  was  sufficient  for  these  tests.  The  ATOM  tool  used  was  version  2.20. 
It  is  also  being  continuously  updated;  research  was  begun  with  version  2.13,  although  the  system  was 
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upgraded  to  version  2.20  before  simulations  were  performed.  Each  new  version  of  ATOM  usually 
addresses  shortcomings  of  past  versions,  particularly  in  terms  of  intrusiveness,  and  refines  the  newer 
capabilities,  such  as  instrumenting  the  kernel,  so  the  most  current  version  available  should  be  used 
for  future  work.  The  test  programs  used  are  from  the  SPEC  92  benchmark  suite.  These  programs 
tend  to  focus  on  technical,  as  opposed  to  commercial,  applications.  They  are  more  computation 
intensive  than  other  potential  test  programs,  but  are  also  readily  available  and  a  standard  test  tool. 

4.3.2  Test  Parameters 

Simulations  were  performed  capturing  cache  miss  rates  for  program  execution  alone,  pro¬ 
grams  with  the  operating  system,  and  multiple  programs  executed  concurrently.  The  four  bench¬ 
marks  used  for  these  simulations  were  [74]: 

Compress  The  compress  benchmark  is  the  same  program  as  the  Unix  compress  utility.  It  is  a 
CPU  intensive  integer  benchmark  which  compresses  an  input  file  using  the  Lempel-Ziv  data 
compression  algorithm.  It  has  a  greater  10  content  than  the  other  benchmarks,  so  is  more 
sensitive  to  the  system  and  execution  environment.  Due  to  its  nature,  the  program  has  a 
repetitive  instruction  reference  stream  with  a  drastically  less  localized  data  reference  stream. 
A  1MB  input  file  in  was  used  with  the  following  command  line: 

# compress  -f  '~c  in  >  /dev/null 

which  causes  the  utility  to  route  the  compressed  data  to  stdout  instead  of  back  to  the  original 
file,  where  it  is  discarded.  This  was  done  so  that  the  execution  of  the  benchmark  did  not  affect 
the  input  program,  which  was  useful  during  repeated  executions.  As  part  of  the  benchmark 
suite,  the  test  calls  for  multiple  iterations  of  compress,  but  for  our  tests  only  a  single  execution 
is  performed  to  reduce  simulation  time.  The  goal  of  this  research  is  not  to  benchmark  the 
system  used,  so  the  full  tests  were  not  required. 

GCC  GCC  is  the  GNU  C  compiler,  and  is  the  most  complex  benchmark  used.  As  a  compiler,  the 
parsing,  organization,  and  optimization  performed  produce  a  highly  irregular  reference  stream. 
Some  10  is  performed,  as  well  cis  a  variety  of  other  system  calls,  and  the  execution  depends 
heavily  on  the  system  used.  The  compiler  was  executed  by: 

#gcc  -0  -quiet  stmt.i  -o  stmt 
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which  caused  it  to  optimize  the  source  code  and  suppress  any  output.  Again,  the  benchmark 
suite  called  for  compilation  of  multiple  programs,  however  only  the  single  input  stmt .  i  was 
used  for  simplicity.  One  note  regarding  the  instrumentation  of  gcc,  it  does  require  certain 
ATOM  flags  the  other  three  benchmarks  do  not.  The  ATOM  command  line  to  be  used  with 
gcc  is: 

y.atom  gcc.rr  -tool  userl  -heapbase  50000  -32addr 

These  are  required  for  ATOM  to  correctly  instrument  gcc,  as  the  compiler  uses  a  wider  range 
of  the  address  space  and  a  larger  heap  segment  of  memory. 

Espresso  Espresso  is  a  tool  for  generating  and  optimizating  Programmable  Logic  Arrays.  Its 
primary  task  is  minimizing  Boolean  functions,  so  also  has  a  repetitive  instruction  stream  with 
a  more  localized  data  stream  than  compress.  It  uses  very  few  operating  system  services,  and 
is  a  small  program  (before  tracing),  so  normally  requires  little  paging.  The  benchmark  wcis 
used  with  the  tial.in  input  file  with  suppressed  output  as  shown  below: 

#espresso  tial.in  >  /dev/null 

As  the  other  programs,  the  actual  benchmark  entails  multiple  input  files,  but  only  this  one 
was  used  for  testing. 

Alvinn  Alvinn  stands  for  Autonomous  Land  Vehicle  in  a  Neural  Network,  and  represents  a  neural 
network  control  system  capable  of  taking  data  from  a  video  camera  and  laser  range  finder  and 
generating  control  data  for  an  automated  vehicle.  The  benchmark  is  a  single  precision  floating 
point  program  which  trains  the  network  through  backpropagation  over  200  input  epochs.  It 
performs  minimal  lO,  although  does  use  the  floating  point  unit  extensively.  It  is  repetitive, 
although  with  a  much  more  complex  structure  than  Compress.  The  command  line  used  was 
simply: 

#backprop  >  /dev/null 

which  activates  the  training  model  with  the  input  files  h_o_w.txt,  i_h_w.txt,  in_pats.txt, 
and  out_pats.txt  residing  in  the  test  directory.  The  results  of  the  training  for  each  epoch 
are  the  only  output,  which  is  discarded. 
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Each  simulation  was  performed  as  described  in  the  previous  sections  using  an  input  file 
of  40  caches  of  various  configurations.  Table  1  assigns  a  number  to  each  cache  which  is  used  for 
later  identification,  and  shows  the  different  characteristics  of  each.  Only  lower  associativities  are 
used  to  minimize  the  amount  of  looping  in  processing.  Other  characteristics  are  arbitrary  selections 
over  a  general  range,  with  a  limit  of  512  lines  per  cache  to  minimize  storage.  The  results  of  these 
simulations  are  discussed  in  the  next  section. 
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1 

1  Unified  or  Instruction 

1  Data 

ID 

Type 

Cache  Size 

Block  Size 

Assoc 

Cache  Size 

Block  Size 

Assoc 

0 

0 

8,192 

64 

2 

NA 

NA 

NA 

1 

0 

16,384 

64 

2 

NA 

NA 

NA 

2 

0 

32,768 

64 

2 

NA 

NA 

NA 

3 

0 

65,536 

64 

2 

NA 

NA 

NA 

4 

1 

4,096 

32 

1 

4,096 

32 

1 

5 

1 

4,096 

32 

2 

4,096 

32 

2 

6 

1 

4,096 

32 

4 

4,096 

32 

4 

7 

1 

4,096 

64 

1 

4,096 

64 

1 

8 

1 

4,096 

64 

2 

4,096 

64 

2 

9 

1 

4,096 

64 

4 

4,096 

64 

4 

10 

1 

4,096 

128 

1 

4,096 

128 

1 

11 

1 

4,096 

128 

2 

4,096 

128 

2 

12 

1 

4,096 

128 

4 

4,096 

128 

4 

13 

1 

8,192 

32 

1 

8,192 

32 

1 

14 

1 

8,192 

32 

2 

8,192 

32 

2 

15 

1 

8,192 

32 

4 

8,192 

32 

4 

16 

1 

8,192 

64 

1 

8,192 

64 

1 

17 

1 

8,192 

64 

2 

8,192 

64 

2 

18 

1 

8,192 

64 

4 

8,192 

64 

4 

19 

1 

8,192 

128 

1 

8,192 

128 

1 

20 

1 

8,192 

128 

2 

8,192 

128 

2 

21 

1 

8,192 

128 

4 

8,192 

128 

4 

22 

1 

16,384 

32 

1 

16,384 

32 

1 

23 

'  1 

16,384 

32 

2 

16,384 

32 

2 

24 

1 

16,384 

32 

4 

16,384 

^  32 

4 

25 

1 

16,384 

64 

1 

16,384 

64 

1 

26 

1 

16,384 

64 

2 

16,384 

64 

2 

27 

1 

16,384 

64 

4 

16,384 

64 

4 

28 

1 

16,384 

128 

1 

16,384 

128 

1 

29 

1 

16,384 

128 

2 

16,384 

128 

2 

30 

1 

16,384 

128 

4 

16,384 

128 

4 

31 

1 

32,768 

64 

1 

32,768 

64 

1 

32 

1 

32,768 

64 

2 

32,768 

64 

2 

33 

1 

32,768 

64 

4 

32,768 

64 

4 

34 

1 

32,768 

128 

1 

32,768 

128 

1 

35 

1 

32,768 

128 

2 

32,768 

128 

2 

36 

1 

32,768 

128 

4 

32,768 

128 

4 

37 

1 

32,768 

256 

1 

32,768 

256 

1 

38 

1 

32,768 

256 

2 

32,768 

256 

2 

39 

1 

32,768 

256 

4 

32,768 

256 

4 

Table  1:  Simulated  Cache  Parameters 
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5  Simulation  Results 


Simulations  of  caches  with  varying  types,  cache  sizes,  associativities,  and  block  sizes  as 
described  in  Table  1,  were  performed  with  the  4  benchmarks.  The  data  generated  by  the  simulations 
has  been  analyzed  by  focusing  on  various  aspects  of  the  cache  behavior.  These  are  the  change  in 
cache  workload,  the  change  in  cache  performance  for  a  specific  process,  the  interference  generated 
between  the  processes,  and  the  net  change  in  cache  performance  over  all  processes.  Other  areas  of 
possible  exploration  include  studying  performance  differences  between  data  reads  and  writes,  and  a 
detailed  characterization  of  the  operating  system  performance.  In  some  instances  only  a  portion  of 
the  available  data  is  shown  in  figures.  Tables  of  all  results  are  provided  in  appendix  B. 

5.1  Cache  Workload 

Before  looking  at  the  cache  performance,  it  is  important  to  understand  how  introducing 
the  operating  system  and  additional  processes  affect  the  memory  reference  stream.  The  first  set 
of  simulations  establish  a  baseline  by  recording  the  cache’s  performance  for  each  benchmark  alone. 
The  frequency  of  each  type  of  reference  is  presented  in  Table  2. 


Benchmark 

Instruction  Fetches 

Data  Reads 

Data  Writes 

Total  Data 

Tot2d  References 

Compress 

87,045,943 

22,412,017 

8,521,660 

30,933,677 

117,979,620 

GCC 

160,240,141 

69,272,173 

229,512,314 

Espresso 

977,787,923 

225,779,346 

59,867,420 

285,646,766 

1,263,434,689 

Alviim 

5,233,222,111  1 

1,415,013,652 

487,428,474 

Table  2:  Benchmark  References 


The  second  set  of  simulations  used  the  same  benchmarks,  but  included  the  operating  system. 
The  frequency  of  each  type  of  reference  is  shown  in  Table  3  for  each  process.  There  is  some  variation 
in  the  number  of  references  for  each  benchmark  due  to  execution  differences,  but  it  is  minimal.  Hello 
World  was  used  for  some  of  the  basic  program  testing,  and  is  included  as  a  curiosity.  For  the  other 
benchmarks,  the  operating  system  overhead  was  generally  small,  less  than  15%  of  the  total  number 
of  references.  For  a  small  program  such  as  Hello  World,  however,  the  operating  system  overhead 
becomes  the  dominant  source  of  memory  references,  totally  overshadowing  the  program. 

The  amount  of  overhead  introduced  by  the  operating  system  is  smaller  than  expected.  This 
is  because  the  tests  were  performed  in  single  user  mode,  and  a  majority  of  the  operating  system 
routines  were  not  being  executed.  In  this  context,  processes  such  as  network  and  printer  controllers. 
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and  the  variety  of  other  background  system  processes  are  considered  to  be  part  of  the  ‘operating 
system’.  One  test  using  ps  in  multi-user  mode  showed  over  40  different  processes  being  executed, 
only  one  of  which  was  actually  a  user  program.  For  these  system  processes  to  be  included,  they 
must  also  be  instrumented.  During  the  simulations  performed,  the  operating  system  references  are 
generally  just  the  overhead  required  by  the  test  programs. 


Benchmark 

Instruction  Fetches 

Data  Reads 

Data  Writes 

Total  Data 

Total  References 

HeUo  World 

1,247 

207 

135 

342 

1,589 

OS 

337491 

84,403 

51,332 

135,735 

473,226 

Total 

338,738 

84,610 

51,467 

136,077 

474,815 

Compress 

87,045,969 

22,412,010 

8,521,661 

30,933,671 

117,979,640 

OS 

5,567,602 

1,518,924 

802,242 

2,321,166 

7,888,768 

Total 

92,613,571 

23,930,934 

9,323,903 

33,254,837 

125,868,408 

GCC 

160,240,175 

50,197,333 

19,074,845 

69,272,178 

229,512,353 

OS 

18,705,569 

5,130,601 

2,613,506 

7,744,107 

26,449,676 

Total 

178,945,744 

55,327,934 

21,688,351 

77,016,285 

255,962,029 

Espresso 

977,787,899 

225,779,331 

59,867,421 

285,646,752 

1,263,434,651 

OS 

29,093,428 

9,107,479 

3,585,537 

12,693,016 

41,786,444 

Total 

1,006,881,327 

234,886,810 

63,452,958 

298,339,768 

1,305,221,095 

Alvinn 

5,233,222,045 

1,415,013,630 

487,428,474 

1,902,442,104 

7,135,664,149 

OS 

197,365,478 

60,413,211 

25,986,851 

86,400,062 

283,765,540 

Total 

5,430,587,523 

1,475,426,841 

513,415,325 

1,988,842,166 

7,419,429,689 

Table  3:  Benchmark  with  Operating  System  References 


The  operating  system  overhead  will  vary  depending  on  the  nature  of  the  program,  but  for 
these  benchmarks  it  remains  fairly  consistent.  The  percent  of  the  total  references  which  are  generated 
by  the  kernel  is  shown  in  Figure  4,  which  ranges  between  2.89  to  12.05  percent.  This  can  also  be 
viewed  as  the  percent  increase  in  number  of  references  as  seen  in  Figure  5,  which  has  a  similar 
range.  For  the  benchmarks  used,  the  program  references  still  dominate.  The  benchmarks  which 
require  minimal  resources  and  I/O  (Espresso  and  Alvinn)  are  the  least  affected  by  the  addition  of 
the  operating  system.  Compress  is  also  fairly  simple,  but  requires  a  larger  amount  of  I/O,  hence  its 
greater  overhead.  A  complex  program  such  as  the  GCC  compiler  is  affected  the  most.  The  amount 
of  overhead  found  in  these  results  is  less  than  that  found  in  past  studies  [1,  2].  Agarwal  found  the 
operating  system  could  increase  the  number  of  instructions  by  5-75%,  but  this  is  also  for  an  older, 
CISC,  architecture.  Both  studies  did  show  that  complex  programs,  such  as  compilers,  are  the  most 
affected. 

Figure  6  shows  the  relative  distribution  of  each  reference  type  within  the  workload  for  both 
the  program  and  its  operating  system  overhead.  Both  the  program  and  operating  system  references 
have  about  the  same  distribution,  with  roughly  70%  instruction  fetches.  This  is  consistent  with 
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%  Increase  In  References  %  References  From  Kernel 


Figure  6:  Distribution  of  Reference  Types 


5  data  writes 
■  data  reads 
in  instruction  fetches 


[8].  The  small  proportion  of  data  writes  explains  the  seemingly  larger  change  seen  in  the  previous 
two  figures  —  there  are  relatively  fewer  data  writes  so  a  smaller  change  generates  a  larger  percent 
difference. 

The  final  set  of  simulations  was  performed  executing  two  benchmarks  concurrently,  captur¬ 
ing  references  from  each  and  the  operating  system.  Results  were  logged  after  each  test  program 
completed.  The  first  report  contains  the  information  of  interest,  the  cache  performance  with  two 
competing  user  programs.  The  second  report  includes  the  period  of  time  after  the  first  process  had 
completed,  so  only  a  single  user  process  was  executing  during  part  of  its  tracing  period.  Since  this 
analysis  focuses  on  the  effects  of  multiple  processes,  the  second  report  has  been  discarded.  For  this 
reason,  the  data  shown  in  Table  4  omits  a  portion  of  the  execution  of  the  longer  process  in  each 
case.  Any  future  references  to  these  simulations  also  refer  specifically  to  the  cache  performance  at 
the  end  of  the  first  program. 

One  fact  that  is  not  visible  from  this  table  is  that  when  both  programs  have  completed,  the 
cumulative  operating  system  overhead  (measured  in  number  of  references)  is  greater  than  the  sum  of 
the  overhead  for  each  program  individually,  as  shown  in  Table  5.  If  the  number  of  operating  system 
references  generated  when  the  benchmarks  axe  executed  separately  are  added  (the  first  column), 
this  value  is  less  than  the  number  of  operating  system  references  generated  when  the  same  two 
benchmarks  are  executed  concurrently  (the  second  column).  This  highlights  the  increased  operating 
system  activity  required  to  switch  between  multiple  processes,  roughly  a  20-40%  increase. 
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Benchmarks 

Instruction  Fetches 

Data  Reads 

Data  Writes 

Total  Data 

Total  References 

Compress 

GCC 

OS 

Total 

87,045,885 

68,021,687 

28,102,411 

183,169,983 

22,411,994 

21,218,807 

7,468,658 

51,099,459 

8,521,651 

8,094,452 

4,160,003 

20,776,106 

30,933,645 

29,313,259 

11,628,661 

71,875,565 

117,979,530 

97,334,946 

39,731,072 

255,045,548 

Compress 

Espresso 

OS 

Total 

87,045,885 

99,475,944 

15,541,809 

202,063,638 

8,521,651 

4,659,787 

2,247,254 

15,428,692 

30,933,645 

28,940,609 

6,558,122 

66,432,376 

117,979,530 

128,416,553 

22,099,931 

268,496,014 

GCC 

Espresso 

OS 

Total 

160,240,175 

224,015,827 

39,004,710 

423,260,712 

50,197,333 

51,131,704 

10,758,087 

112,087,124 

mi 

69,272,178 

63,229,622 

16,350,661 

148,852,461 

229,512,353 

287,245,449 

55,355,371 

572,113,173 

Table  4:  Concurrent  Benchmarks  with  Operating  System  References 


Benchmarks 

Sum  of  Individual  Overheads 

Concurrent  Overhead 

Compress  /GCC 

34,338,444 

47,433,154 

Compress /Espresso 

49,675,212 

59,365,363 

GCC/Espresso 

68,236,120 

89,030,467 

Table  5:  System  Overhead  Comparison 


A  problem  arose  when  certain  programs  (or  combinations  of  programs)  were  traced,  gen¬ 
erating  the  trap:  invalid  memory  access  error  mentioned  previously.  It  is  somehow  related  to 
the  size  or  length  of  the  test  programs.  Benchmarks  such  as  Xlisp  (9,561,089,165  references)  and 
Ear  (17,375,158,291  references)  would  crash  the  platform  if  simulated  with  the  operating  system. 
Similarly,  executing  any  of  the  three  smaller  benchmarks  concurrently  with  Alvinn  would  crash  the 
system,  as  well  as  any  three  programs  in  combination.  While  this  problem  limited  the  scope  of  the 
simulations,  correcting  it  was  beyond  the  purview  of  this  research. 

5.2  Impact  on  Process  Performance 

The  simplest  way  to  visualize  the  impact  of  the  operating  system  and  additional  processes  is 
to  measure  their  effect  on  the  cache  performance  for  a  particular  program’s  reference  stream.  Figures 
7  through  14  show  the  cache  miss  rates  for  benchmark  references  only,  for  each  of  the  4  benchmarks. 
The  baseline  is  the  result  from  the  single  process  cache  simulation.  The  other  sets  of  results  are 
essentially  the  same  reference  stream  but  with  transient  misses.  Any  performance  changes  are  due 
strictly  to  these  transient  effects. 

The  single  process  results  exhibit  normal  cache  behavior.  As  expected,  increasing  cache 
size  decreases  miss  rate.  A  larger  cache  can  contain  more,  if  not  all,  of  a  programs  working  set. 
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thus  reducing  capacity  misses.  Also,  a  larger  cache  will  have  fewer  locations  assigned  to  each  line, 
potentially  reducing  conflict  misses.  Increasing  associativity  also  decreases  miss  rates,  although  with 
diminishing  returns;  the  improvement  from  A=2  to  A=:4  is  less  than  the  improvement  from  A=1  to 
A=2.  Associativity  can  reduce  conflict  misses  by  allowing  a  line  to  maintain  more  than  one  block  at 
a  time,  but  the  benefits  are  limited  by  the  number  of  references  to  any  one  line.  Since  the  caches  use 
a  constant  area,  increasing  the  associativity  decreases  the  number  of  possible  indices,  thus  increasing 
the  stress  on  a  single  index.  For  this  reason,  in  some  instances  increasing  associativity  can  increase 
the  miss  rate  (e.g.  Alvinn).  Increasing  the  block  size  increases  the  amount  of  memory  fetched  on 
each  miss.  This  is  generally  beneficial  for  instruction  references  which  exhibit  spatial  locality,  but 
the  reverse  may  be  true  for  data  references.  Depending  on  the  benchmark,  data  miss  rates  can  either 
increase  (e.g.  Compress)  or  decrease  (e.g.  Espresso)  as  block  size  increases,  but  this  trend  is  also 
related  to  associativity  and  other  factors.  Increasing  block  size  also  decreases  the  number  of  cache 
indices,  so  again  the  load  on  each  line  is  increased  potentially  negating  any  benefits.  These  results 
are  comparable  to  those  found  in  [25,  45,  56]. 

Comparing  the  single  process  results  with  the  other  simulations,  these  trends  are  not  gen¬ 
erally  affected.  In  most  Ccises,  the  results  follow  the  same  patterns  but  with  a  noticeable  increase 
in  cache  miss  rates.  The  amount  of  increase  may  vary  by  cache  or  remain  relatively  constant,  de¬ 
pending  on  the  characteristics  of  the  particular  benchmark  being  considered.  This  increase  is  the 
error  in  assuming  that  cache  behavior  can  be  defined  by  a  single  process  simulation,  and  shows  the 
difference  between  a  single  program’s  cache  performance  when  it  is  considered  alone  versus  when 
it  is  considered  in  a  multiprocess  simulation.  As  can  be  seen,  the  impact  of  the  operating  sys¬ 
tem  is  much  smaller  than  that  of  an  additional  process.  This  is  logical,  considering  the  operating 
system  normally  executes  for  shorter  durations  as  it  services  system  calls  and  interrupts.  The  im¬ 
pact  of  additional  processes  is  generally  most  pronounced  in  those  caches  that  already  exhibit  poor 
performance,  although  this  does  depend  on  the  benchmark. 

It  is  also  interesting  to  consider  the  distribution  of  misses.  Figures  7  through  13  show 
the  percent  of  misses  that  were  from  instruction  references.  It  is  interesting  to  note  that  although 
instructions  make  up  the  majority  of  references,  they  are  usually  in  the  minority  of  misses  —  as 
expected  due  to  their  increased  locality.  For  programs  such  as  Compress  or  Alvinn  with  a  great  deal 
of  spatial  locality  in  their  instructions  but  not  data,  the  loss  of  locality  due  to  transient  interference 
is  visible  in  the  increased  proportion  of  instruction  misses  found  in  the  simulations  which  included 
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the  operating  system  and  additional  processes.  Other  programs  such  as  Espresso  may  be  affected 
either  way,  although  data  misses  still  predominate.  A  more  complex  program  such  as  GCC  has 
much  less  locality  in  its  reference  stream,  as  can  be  seen  by  the  fact  that  instructions  account  for 
as  much  as  65%  of  its  misses.  Hence  when  the  additional  processes  are  considered,  it  is  possible  for 
data  cache  hit  rates  to  be  affected  more  and  the  ratio  to  go  down. 

5.3  Process  Interference 

Another  way  to  visualize  the  impact  of  the  additional  references  is  to  analyze  the  proportion 
of  intrinsic  versus  extrinsic  interference  seen  by  the  various  test  programs.  The  percentage  of  misses 
attributed  to  intrinsic  interference  can  be  approximated  by  the  percent  of  misses  where  the  reference 
overwrote  a  block  containing  information  from  the  same  program.  The  alternative  is  for  the  reference 
to  miss  and  overwrite  another  program’s  data,  highlighting  extrinsic  interference.  A  certain  number 
of  references  will  miss  and  overwrite  invalid  data  at  start  up,  but  these  are  finite  (based  on  cache 
size),  and  will  not  significantly  affect  the  percentage.  The  self  overwrite  percentage  is  shown  for 
each  cache  for  the  4  benchmarks  in  Figures  19  through  22.  When  a  block  is  overwritten  no  test 
is  performed  to  see  if  the  evicted  data  is  live,  nor  is  there  a  check  of  the  new  data  to  determine 
if  it  has  been  accessed  before,  so  these  figures  are  not  exactly  intrinsic  interference,  but  should  be 
comparable. 

The  most  basic  simulation  with  a  single  benchmark  as  input  will  have  100%  of  its  misses 
due  to  internal  considerations,  by  definition.  When  the  operating  system  is  added,  roughly  10-20% 
of  the  misses  are  external  overwrites,  due  to  the  impact  of  the  OS  references.  Adding  an  additional 
process  to  the  simulation  increases  the  external  impact  to  40-70%,  depending  on  the  cache  and 
particular  program.  It  is  unfortunate  that  it  was  not  possible  to  perform  simulations  with  a  greater 
multitasking  level  so  that  a  trend  might  be  visible. 

Smaller  caches  are  affected  more  by  extrinsic  interference  as  expected,  as  are  caches  with 
lower  associativities.  As  each  process  is  executed,  its  references  are  loaded  into  the  cache.  A  smaller 
cache  may  be  totally  overwritten  by  the  new  data,  while  a  larger  cache  may  be  able  to  retain 
a  portion  of  the  previous  program’s  working  set.  Program  characteristics  such  as  the  amount  of 
system  overhead,  as  well  as  working  set  size  and  fluctuation,  affect  the  amount  of  interference,  but 
are  more  difficult  to  quantify  without  an  extensive  trace  analysis. 
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Figure  9:  Process  Instruction  Reference  Miss  Rates  For  GCC 
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Figure  10:  Process  Data  Reference  Miss  Rates  For  GCC 
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Figure  12:  Process  Data  Reference  Miss  Rates  For  Espresso 


alone  (5=4096) 

w/ OS  (5=4096) 

w/  OS  &  Compress  (5=4096) 

w/  OS  and  GCC  (5=4096) 

alone  (5=8192) 

w/  OS  (5=8192) 

w/  OS  &  Compress  (5=8192) 

w/  OS  and  GCC  (5=8192) 

alone  (5=16384) 

w/  OS  (5=16384) 

w/  OS  &  Compress  (5=16384) 

w/  OS  and  GCC  (5=16384) 

alone  (5=32768) 

w/ OS  (5=32768) 

w/  OS  &  Compress  (5=32768) 

w/  OS  and  GCC  (5=32768) 


alone  (5=4096) 

w/  OS  (5=4096) 

w/  OS  &  Compress  (5=4096) 

w/  OS  and  GCC  (5=4096) 

atone  (5=8192) 

w/OS  (5=8192) 

w/  OS  &  Compress  (5=8192) 

w/  OS  and  GCC  (5=8192) 

,  alone  (5=16384) 

,  w/  OS  (5=16384) 
w/  OS  &  Compress  (5=16384) 
w/  OS  and  GCC  (S=16384) 

.  alone  (5=32768) 

,  w/  OS  (S=32768) 
w/  OS  &  Compress  (5=32768) 
w/  OS  and  GCC  (5=32768) 


atone  (5=4096) 

w/  OS  (S=4096) 

w/  OS  &  Compress  (5=4096) 

w/  OS  and  GCC  (S=4096) 

alone  (5=8192) 

w/OS  (S=8192) 

w/  OS  &  Compress  (5=8192) 

w/  OS  and  GCC  (S=8192) 

,  alone  (S=16384) 
w/OS  (S=16384) 
w/  OS  &  Compress  (S=1 6384) 
w/  OS  and  GCC  (S=16384) 

,  alone  (S=32768) 

,  w/  OS  (S=32768) 
w/  OS  &  Compress  (5=32768) 
w/  OS  and  GCC  (S=32768) 


53 


Data  References,  A=1 


alone  (S=4096) 


Data  References,  A=2 


Data  References,  A=4 


,  ,  ^  .  w/ os  (S=4096) 


,  alone  {S=81 92) 


-  .  ^  .  w/OS(S=8192) 


.  alone  (S=16384) 


.  -  Q.  .  .W/0S(S=16384) 


.  alone  (S=32768) 


,  -  O-  -  -W/OS(S=32768) 


.  alone  (S=4096) 


-  -  ^  -  w/  OS  (S=4096) 


.  alone  (S=8192) 


.  .  .  .  w/OS(S=8192) 


.  alone  (8=16384) 


-  .  Q.  -  .w/OS(S=16384) 


.  alone  (S=32768) 


-  -  O-  -  -w/ OS  (3=32768) 


.  alone  (S=4096) 


were  Instructions 


%  Misses  Self  Overwritten  %  Misses  Self  Overwritten  %  Misses  Self  Overwritten  %  Misses  Self  Overwritten 


Cache  # 


Figure  19:  Percent  Self  Overwritten  for  Compress 
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Figure  22:  Percent  Self  Overwritten  for  Alvinn 
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5.4  Impact  on  Cache  Performance 

.  So  far  this  analysis  has  focused  on  the  cache  performance  within  the  context  of  a  single 
program.  The  impact  of  the  operating  system  and  additional  processes  is  also  a  factor  when  the 
aggregate  cache  performance  is  considered,  encompassing  all  references  from  the  trace.  These  results 
are  shown  in  Figures  23  through  30,  which  are  organized  identically  to  the  ones  before.  The  single 
process  simulations  for  each  benchmark  are  again  used  as  a  baseline,  with  the  total  cache  performance 
plotted  for  each  simulation  that  involved  that  benchmark.  Results  from  simulations  with  multiple 
processes  are  shown  in  multiple  figures,  but  because  all  references  are  considered,  the  net  cache 
performance  is  the  same  regardless  of  which  process  is  used  as  the  perspective. 

The  total  miss  rate  is  essentially  a  weighted  average  of  the  miss  rates  of  the  component 
processes,  as  shown  below: 

(1) 

where  M  is  the  total  miss  rate,  rrip  is  the  number  of  misses  for  each  process,  and  Vp  is  the  number  of 
references  for  each  process.  Because  it  is  a  weighted  average,  the  behavior  of  the  total  miss  rate  may 
be  dominated  by  the  miss  rate  behavior  of  one  of  the  component  processes.  A  process  may  dominate 
the  average  because  of  the  number  of  references  it  generates,  such  as  the  combination  of  a  benchmark 
and  its  respective  operating  system  overhead  (which  has  fewer  references).  A  process  may  also 
dominate  the  average  because  of  its  performance.  For  example.  Compress  suffers  from  particularly 
poor  data  cache  performance,  so  any  simulation  involving  Compress  will  have  the  average  data 
cache  performance  dominated  by  Compress’  characteristics.  On  the  contrary,  Compress  also  has  the 
lowest  instruction  cache  miss  rates,  so  the  average  instruction  cache  performance  is  dominated  by 
whatever  process  is  executed  with  Compress.  The  dominant  process  will  define  the  gross  performance 
characteristics  of  the  overall  cache  behavior.  For  instance,  the  miss  rate  fluctuations  as  a  certain 
parameter  varies,  such  as  cache  size. 

The  impact  of  each  benchmark  can  be  seen  by  its  contribution  to  the  total  miss  rate,  but  the 
impact  of  the  operating  system  is  not  as  visible.  Figures  31  and  32  show  the  percent  of  misses  that 
are  due  to  kernel  references  for  instructions  and  data  respectively.  As  can  be  seen,  the  impact  to  the 
data  cache  is  much  more  consistent  than  that  to  the  instruction  cache.  The  instruction  impact  varies 
significantly  depending  on  the  benchmark  in  question  and  the  demands  it  places  on  the  operating 
system.  Cache  design  parameters  can  also  be  a  factor,  as  the  larger  caches  have  a  larger  portion  of 
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the  misses  due  to  the  kernel.  This  is  logical  as  the  programs  with  their  larger  footprints  can  take 
advantage  of  the  larger  caches,  while  the  operating  system  with  its  shorter  execution  intervals  may 
never  leave  the  cache  warm  up  phase. 

5.5  Summary 

Based  on  the  evidence  shown  here,  a  few  generalizations  can  be  made  about  the  observed  cache 
performance. 

•  Both  operating  system  and  additional  user  processes  will  significantly  affect  cache  performance, 
with  the  user  programs  generating  the  largest  impact. 

•  For  a  given  process,  the  performance  is  always  degraded  due  to  the  external  interference, 
although  if  the  net  performance  over  multiple  processes  is  considered  it  may  be  better  than 
the  performance  for  just  one  of  the  component  processes  due  to  averaging. 

•  The  primary  source  of  this  performance  degradation  is  in  the  loss  of  temporal  locality.  The 
interference  between  the  various  processes  does  not  affect  each  process’  spatial  locality,  but 
with  frequent  interruptions  in  process  execution  there  is  a  loss  of  temporal  locality  across  each 
interruption. 

•  The  worst  degradation  is  in  caches  which  already  suffered  from  poor  performance. 

•  The  amount  of  degradation  and  any  patterns  it  follows  depends  greatly  on  the  specific  processes 
involved,  and  the  effects  observed  can  vary  greatly.  This  is  due  to  the  differences  in  program 
behavior  such  as  system  demands  (system  calls,  interrupts)  and  footprint  (size,  length,  working 
set). 

•  The  overall  cache  performance  is  an  average  of  the  performance  of  the  component  processes. 
The  individual  process  performance  characteristics  are  interrelated,  so  are  difficult  to  determine 
independently. 

This  is  contrary  to  some  of  the  initial  assumptions  made  in  [1,  2,  3],  which  have  since  been  discarded. 
These  results  are  more  comparable  to  those  found  in  [11,  12,  13]. 
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Figure  23:  Instruction  Cache  Miss  Rates  With  Compress 
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Figure  24:  Data  Cache  Miss  Rates  With  Compress 
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Figure  25:  Instruction  Cache  Miss  Rates  With  GCC 
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Figure  27:  Instruction  Cache  Miss  Rates  With  Espresso 


64 


Miss  Rate  (%)  Miss  Rale  (%)  Miss  Rate  (%) 


Data  References,  A=1 


21 


alone  (S=4096) 

w/  OS  {S=4096) 

w/  OS  &  Compress  (S=4096) 

w/OS  andGCC  (S=4096) 

alone  (S=8192) 

w/OS(S=8192) 

w/  OS  &  Compress  (S=81 92) 

w/OS  and  GCC(S=8192) 

alone  (S= 16384) 

w/OS  (S= 16384) 

w/  OS  &  Compress  (S=16384) 

w/OS  and  GCC(S=16384) 

alone  (S=32768) 

w/  OS  (S=32768) 

w/  OS  &  Compress  (S=32768) 

w/OS  and  GCC  {S=32768) 


.  alone  (S=4096) 


.  .  ^  .  w/ OS  (S=4096) 

^  -  w/ OS  &  Compress  (S=4096) 

_  ^  _  w/OS  and  GCC  (S=4096) 

^  alone  (S=8192) 

-  ,  ^  -  .  w/OS(S=8192) 

^  .  w/ OS  &  Compress  (S=:8 192) 

_  _  w/OS  and  GCC  (S=81 92) 

jj  alone  (S=16384) 

.  .  Q.  .  ,w/OS(S=16384) 

_  ,  w/ OS  &  Compress  (S=1 6384) 

_  ^  _  w/OS  and  GCC  {S=1 6384) 

^  alone  (S=32768) 


-  -  O*  -  -W/OS(S=32768) 

»  Q  «  w/  OS  &  Compress  (S=32768) 
_  ^  w/ OS  and  GCC  (S=32768) 


alone  (S=4096) 
w/  OS  {S=4096) 
w/  OS  &  Compress  (S=4096) 


w/  OS  and  GCC  (S=4096) 

alone  (8=8192) 

w/  OS  (S=8192) 

w/  OS  &  Compress  (S=81 92) 

w/  OS  and  GCC  (S=81 92) 

alone  ($=16384) 

w/  OS  (S=16384) 

w/  OS  &  Compress  (S= 16384) 

w/OS  and  GCC(S=16384) 


.  alone  (S=32768) 


,  ,  ^  _  w/  OS  (S=32768) 

_  Q  ,  w/  OS  &  Compress  (S=32768) 
_  ^  _  w/ OS  and  GCC  (S=32768) 


65 


32 


64 


Block  Size  (Bytes) 


128 


256 


Figure  29:  Instruction  Cache  Miss  Rates  With  Alvinn 
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Figure  30:  Data  Cache  Miss  Rates  With  Alvinn 
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Figure  31:  Percent  Instruction  Misses  From  Kernel 
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5.6  Future  Work 


With  the  simulations  already  performed,  there  is  still  a  considerable  amount  of  data  analysis 
that  could  be  performed,  as  more  specific  aspects  of  cache  performance  are  considered.  Also,  a 
number  of  improvements  to  the  simulation  program  were  outlined  in  section  4,  which  should  ideally 
be  included  before  any  future  work  is  performed  with  this  tool.  The  most  fundamental  change  should 
be  towards  modeling  more  of  the  memory  system,  to  include  such  aspects  as  traffic  to  memory, 
physical  address  mapping,  write  policies,  and  cache  service  times.  Other  additions  can  be  readily 
made  to  the  cache  simulator  to  study  specific  aspects  of  cache  design,  such  as  alternative  replacement 
algorithms  in  associative  caches,  different  address  hashing  algorithms,  or  pre  fetching  possibilities. 

Other  more  substantial  changes  could  be  made  to  generate  different  forms  of  performance 
data.  One  area  is  analyzing  sampled  cache  performance,  looking  at  cache  performance  over  shorter 
time  periods  to  study  the  effects  of  short  term  working  set  changes.  Another  area  is  tracing  the 
operating  system  in  particular,  capturing  data  from  the  various  kernel  threads  separately,  as  well  as 
determining  the  source  of  system  calls.  Another  possibility  is  to  provide  a  more  detailed  reference 
record  so  that  reference  gap  information  is  available  to  study  interference  patterns  in  more  detail. 
On  the  most  generic  level,  such  a  tool  can  also  be  used  to  generate  traces  for  other  work.  Finally, 
this  research  will  provide  the  background  necessary  for  continued  study  of  the  operating  system 
through  the  development  of  new  ATOM  tools. 
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6  Context  Switch  Model 

6.1  Theory 

With  ATOM,  it  is  now  possible  to  generate  simulations  with  a  broader  scope  than  just  a 
single  process.  As  a  commercially  available  tool  with  a  great  deal  of  flexibility,  ATOM  is  simpler 
to  use  than  past  methods,  but  it  still  requires  a  significant  amount  of  additional  time  and  resources 
to  perform  the  cache  analysis.  An  improvement  would  be  to  approximate  the  accuracy  of  a  com¬ 
prehensive  simulation  without  the  additional  effort.  One  possible  method  is  to  develop  a  synthetic 
model  which  would  generate  complex  traces  without  the  execution  of  programs.  Such  a  technique 
would  exercise  the  entire  cache  like  a  real  environment,  but  is  difficult  to  verify  and  is  beyond  the 
scope  of  this  work. 

A  simpler  method  is  to  study  a  single,  more  focused,  aspect  of  cache  performance.  Here  we 
only  consider  the  performance  of  a  single  process,  but  in  the  context  of  a  multi-process  environment, 
similar  to  that  considered  by  Agarwal  in  [3].  Instead  of  an  entire  synthetic  workload,  an  analytical 
model  can  be  used  in  conjunction  with  a  single  process  trace.  In  this  way,  the  cache  behavior  of 
a  single  process  can  be  predicted  more  accurately  with  only  a  simple  simulation.  The  model  is 
responsible  for  injecting  the  desired  multi-process  characteristics  into  the  simulation,  which  can  be 
achieved  through  a  statistical  approach. 

The  simulation  of  a  single  process  will  identify  its  own  characteristics,  and  the  introduction 
of  the  statistical  model  will  incorporate  the  transient  effects  of  a  complex  environment.  This  can 
be  achieved  by  analyzing  the  effect  of  the  operating  system  and  additional  processes  on  a  single 
process,  and  mimicking  this  in  the  simulation  program.  As  will  be  seen,  this  is  essentially  modeling 
context  switch  characteristics  in  the  cache  [31,  41,  56].  Though  it  will  not  be  as  accurate  as  the  full 
simulation,  it  will  be  faster  and  much  easier  to  execute.  For  an  approximate  result,  it  is  much  more 
efficient. 

From  the  perspective  of  a  single  process,  it  is  the  sole  user  of  the  cache  at  any  given  point 
in  time  (assuming  a  uniprocessor  environment).  However,  the  time  the  process  is  actually  being 
executed  is  not  continuous  for  its  entire  lifetime.  The  process  is  instead  broken  up  into  shorter 
continuous  segments  separated  by  context  switches.  Between  these  segments,  operating  system 
routines  or  other  processes  are  being  executed,  which  can  overwrite  some  or  all  of  the  process’  cache 
blocks.  Assuming  all  the  various  processes  are  independent,  these  interruptions  are  transparent  to 
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any  single  process  and  each  process  is  not  “aware”  of  the  other  processes  being  executed.  Here  the 
term  interruption  is  used  to  denote  the  time  from  when  a  given  program  is  switched  out  of  execution 
to  the  point  it  is  returned  to  execution.  The  net  effect  to  the  cache  is  that  from  a  specific  program’s 
perspective,  it  is  executed  continuously,  but  at  certain  times  during  its  execution  some  or  all  of  its 
cache  blocks  are  overwritten  or  invalidated.  Figure  33  shows  the  difference  between  this  perspective 
and  the  actual  environment,  showing  a  basic  time  space  diagram  of  process  execution.  This  would 
be  the  condition  in  a  multitcisked  uniprocessor  where  each  thread  or  program  is  considered  to  be  a 
unique  process  with  a  unique  reference  stream. 
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Proc  1 


Time 


This  would  suggest  that  by  modeling  context  switches,  the  gap  between  single  and  multiple 
process  simulations  can  be  bridged.  There  are  basically  two  fundamental  questions  that  must  be 
addressed  by  such  a  statistical  model: 

1.  how  often  the  execution  of  a  program  is  interrupted  by  a  context  switch,  and 

2.  what  is  the  impact  to  the  cache  state  caused  by  this  interruption. 

These  questions  are  not  eeisily  answered.  Timing  of  context  switches  can  depend  on  many  variables 
including  the  physical  system  state,  how  the  system  is  loaded,  and  characteristics  of  the  programs. 
Similarly,  the  impact  will  depend  on  the  state  of  program  execution,  the  amount  of  live  data  present 
in  the  cache,  and  the  amount  of  overlap,  if  any,  between  the  working  sets  of  the  various  programs.  The 
model  will  depend  heavily  on  the  particular  system  involved,  and  must  be  developed  with  both  the 


71 


hardware,  operating  system,  and  test  programs  in  mind.  Once  these  factors  are  understood,  they 
can  be  incorporated  into  the  simulation  program  so  that  simulations  would  theoretically  provide 
results  comparable  to  the  program  being  executed  in  a  realistic  environment  [23]. 


6.2  Development 

The  first  step  in  developing  the  model  is  to  ensure  that  it  is  applicable  to  our  test  system 
[17,  39,  65,  69].  Our  Alpha  based  system  meets  the  criteria  described  above.  It  is  a  single  processor 
machine  running  OSF,  which  can  execute  multiple  processes  on  a  timesharing  basis.  Instructions 
and  data  can  be  shared  between  processes,  but  their  dependence  can  be  minimized  by  choosing 
appropriate  test  programs.  The  impact  of  the  test  platform  on  the  traces  is  assumed  to  be  consistent 
across  all  simulations  and  is  ignored.  The  references  generated  are  64  bit  virtual  addresses  in  a 
continuous  address  space,  so  no  adaptation  of  the  simulation  model  is  necessary. 

Understanding  the  operating  system  is  the  most  important  aspect  of  developing  the  model 
[4,  9,  18,  70,  72,  71].  The  operating  system  both  generates  its  own  set  of  references,  as  well  as  controls 
the  scheduling  of  the  other  reference  streams.  The  OSF/1  operating  system  is  a  threaded  collection 
of  processes  which  includes  system  calls,  interrupt  handlers,  and  other  overhead  management /control 
routines.  These  can  be  modeled  simply  as  a  collection  of  additional  processes  of  varying  length  that 
are  executed  at  random  intervals.  The  processes  are  switched  in  and  out  of  execution  just  like  the 
test  programs.  The  priority  of  these  processes  would  require  that  they  occur  at  any  time,  preempting 
the  execution  of  the  test  process.  The  various  threads  that  make  up  the  kernel  are  not  independent, 
and  may  share  substantial  amounts  of  data.  By  considering  the  threads  of  the  kernel  collectively 
cLS  the  operating  system  overhead,  as  was  done  in  the  earlier  simulations,  the  model  can  neglect 
this  shared  data  with  minimal  loss  of  accuracy.  The  remaining  issue  is  the  degree  of  data  sharing 
between  the  program  and  the  operating  system,  which  is  difficult  to  pinpoint.  For  the  purpose  of  this 
model,  this  dependence  is  assumed  to  be  minimal  and  is  neglected,  which  is  a  reasonable  assumption 
for  the  choice  of  benchmarks.  Any  simulation  of  threaded  programs  or  other  programs  which  use 
substantial  cross  process  communication  cannot  use  these  simplifying  assumptions. 

Given  that  this  type  of  model  is  applicable  to  the  simulations  already  performed,  our  next 
task  is  to  analyze  the  system  and  program  characteristics  to  define  the  model’s  structure.  A  context 
switch  mechanism  must  be  introduced  into  the  simulation,  and  the  effects  of  each  interruption  in 
execution  incorporated  appropriately. 
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6.3  Implementation 


One  of  the  most  basic  forms  of  modeling  multiprocessing  is  to  totally  flush  the  cache  at 
regular  intervals,  modeling  the  effect  of  context  switches  between  processes  executing  in  a  round 
robin  fcishion  [3,  21,  56].  This  is  realistic  for  a  virtually  addressed  cache  without  process  identifiers, 
and  a  reasonable  approximation  for  a  small  cache  when  a  context  switch  will  probably  overwrite 
all  data,  but  not  appropriate  for  larger  caches  when  data  survival  is  likely.  A  more  accurate  and 
versatile  model  is  necessary,  but  will  be  more  complex. 

For  a  model  to  be  effective,  however,  it  cannot  be  so  complex  that  direct  simulation  becomes 
a  better  alternative.  If  a  detailed  description  of  the  test  program  is  required  just  to  develop  the 
model,  then  simulation  may  be  just  as  effective.  It  is  also  important  that  the  model  directly  relates 
to  the  system  it  represents.  In  [31],  a  very  comprehensive  model  is  developed.  Unfortunately,  it 
requires  a  thorough  analysis  of  the  program  trace  to  define  the  model  parameters,  thus  limiting  its 
usefulness.  Also,  it  fails  to  consider  some  very  basic  variations  in  cache  architecture.  A  balance  is 
necessary,  the  model  must  be  complex  enough  to  be  accurate,  but  based  on  basic  properties  of  the 
system  and  programs  that  are  easily  observed.  With  this  in  mind,  the  model  can  be  developed  by 
answering  the  two  questions  mentioned  above. 

6.3.1  Frequency 

The  answer  to  the  first  question  is  based  on  the  execution  interval  of  a  program,  or  how 
long  it  is  executed  before  a  context  switch  occurs.  This  is  heavily  dependent  on  how  execution  is 
scheduled,  which  is  controlled  by  the  operating  system  [19].  A  process  is  executed  until  it  either 
is  switched  voluntarily  (i.e.,  while  it  waits  for  some  system  resource,  or  requests  a  system  call), 
it  is  preempted  by  a  higher  priority  process  (i.e.,  an  interrupt  service  routine),  or  it  is  switched 
involuntarily  for  another  user  process  (i.e.,  the  end  of  a  fixed  time  allocation  is  encountered).  The 
initial  priority  of  a  process  depends  on  its  type  (system  versus  user)  and  its  requirements  (interactive 
versus  compute  intensive).  The  priority  can  degrade  while  the  process  is  being  executed  and  is 
promoted  while  it  is  stalled,  which  prevents  a  single  process  from  dominating  the  system  resources. 
In  a  fixed  priority  scheme,  processes  of  equal  priority  are  processed  according  to  a  policy,  either  first 
in  first  out  (the  program  executes  until  completion)  or  round  robin  (programs  are  switched  after  a 
fixed  interval,  taking  turns)  [16,  71].  The  time  sharing  in  OSF/1  is  on  a  thread  basis,  however  the 
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test  programs  are  all  single  threaded,  and  the  various  threads  of  the  operating  system  are  considered 
as  a  conglomerate  from  the  cache’s  perspective. 

For  the  model,  we  use  a  basic  scheme  based  on  this  information.  We  assume  that  all 
operating  system  level  processes  have  a  higher  priority  than  any  test  program  process,  so  they  can 
interrupt  test  program  execution  at  any  time.  These  processes  will  include  both  interrupt  service 
routines  and  system  calls.  All  test  programs  run  at  the  same  priority,  with  a  round  robin  scheduling. 
For  a  single  program,  this  defines  the  characteristics  of  its  execution  interval.  The  interval  has  some 
maximum  value  where  a  context  switch  is  automatic,  but  up  to  that  point  there  is  some  probability 
that  a  switch  will  occur  earlier  due  to  either  an  interrupt,  system  call,  or  stall  waiting  for  resources. 
Based  on  results  from  previous  studies  [8,  31,  41],  this  probability  follows  an  exponential  distribution. 
Most  processes  execute  for  a  short  interval;  with  an  exponential  reduction  so  very  few  processes 
consume  the  maximum  interval  —  showing  that  context  switches  are  a  regular  occurrence.  With 
round  robin  scheduling,  the  number  of  test  programs  considered  in  the  model  does  not  affect  the 
execution  interval. 

To  incorporate  this  fact  into  the  model,  a  random  variable  R  is  defined  representing  the 
execution  interval  length  in  number  of  references  r  with  an  exponential  probability  density  function. 
A  distribution  of  this  kind  has  the  form  [53]: 

/(r)  =  (2) 

where  ^  is  a  constant  which  defines  the  shape  of  the  curve  and  its  expected  value.  The  probability 
that  any  given  reference  interval  R  will  be  r  references  or  less  is  defined  by: 

P[R  <r]=  f  f{r)dr  =  1  --  (3) 

«/  —  oo 

If  we  assume  that  an  interval  will  be  as  long  as  possible,  then  this  can  be  used  as  the 
probability  that  a  given  execution  interval  R  is  r  references  long,  expressed  as: 

p=l-el?  (4) 

This  function  could  be  incorporated  into  the  program  by  determining  the  probability  of  a  given 

interval  as  that  reference  is  reached.  A  random  number  in  [0..1]  is  then  generated  at  each  reference 
to  determine  if  a  switch  is  necessary.  A  better  solution  is  to  invert  the  equation  to  yield: 

r  =  -Mln(l-p)  (5) 
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Thus  generating  a  random  number  in  [0..1]  will  generate  an  appropriate  execution  interval  length  r 
(rounded  to  an  integer  value),  as  shown  in  Figure  34. 


Figure  34:  Execution  Interval  Given  Some  Probability  [0..1] 


The  remaining  unknown  is  /x,  which  can  be  determined  by  defining  the  desired  maximum 
execution  interval.  In  [8,  41]  this  was  400,000  traced  instructions,  or  25,000  untraced,  although  these 
values  based  on  a  system  that  is  no  longer  contemporary.  If  we  assume  that  each  program  executes 
for  a  maximum  10  ms  time  slice  on  a  system  with  a  20  ns  cycle  and  average  of  2  cycles  used  per 
instruction  [71],  this  generates  a  maximum  interval  of  250,000  references: 

_ _ =  250, 000  (6) 

(2 _ _ )(20e  -  9  interval 

V  instruction'^  cycle  * 

At  this  point,  the  probability  of  a  context  switch  defined  above  should  approach  1,  or 


lim  6/^=0  (7) 

r—*T  mets 

Obviously  this  cannot  be  exact,  but  selecting  a  /x  of  or  50000,  is  accurate  to  0.006738  which  is 
sufficient  for  this  application.  Since  the  exponential  function  cannot  define  the  maximum  value,  an 
explicit  limit  is  set  on  the  function,  so  that  the  final  definition  of  each  execution  interval  is  given  by: 


r  =  min(-50000ln(l  -  p),  250000) 


(8) 


which  is  the  function  used  to  generate  Figure  34. 

Incorporating  this  into  software,  at  program  start  and  after  every  context  switch,  a  random 
value  is  generated  in  [0..1].  This  is  applied  to  the  above  function  to  determine  the  execution  interval. 
A  counter  is  maintained  of  the  number  of  instruction  references  since  the  last  context  switch,  and 
when  these  two  values  are  equal,  the  switch  impact  model  discussed  below  is  performed.  The  actual 
distribution  generated  by  the  random  function  is  shown  in  Figure  35,  showing  the  probability  of  a 
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specific  interval  determined  by  the  number  of  intervals  out  of  250,000,000  generated.  The  probability 
of  any  particular  interval  is  low,  but  the  cumulative  probability  of  a  context  switch  as  the  interval 
increases  to  its  maximum  value  approaches  1  as  expected.  The  spike  at  250000  references  is  due  to 
the  limit  in  the  function,  and  is  negligible  in  the  cumulative  distribution. 

0.007 


6.3.2  Impact 

The  second  question  addresses  the  likelihood  that  data  in  the  cache  is  overwritten  by  the 
processes  executed  during  the  interruption.  As  stated  before,  simply  invalidating  the  entire  cache 
is  not  a  realistic  model.  Instead,  the  model  must  take  into  account  the  footprints  of  all  processes 
executed  during  the  interruption  to  determine  what  portion  of  the  cache  is  overwritten.  This  is 
addressed  by  both  Agarwal  [3]  and  Thiebaut  and  Stone  [56].  Both  models  attempt  to  evaluate  all 
aspects  of  the  cache  analytically.  By  using  simulations,  much  of  the  model  can  be  discarded.  Instead, 
only  the  relevant  function  regarding  the  probability  of  cache  line  replacement  is  used.  Both  papers 
use  identical  functions  to  determine  the  probability  that  a  program’s  working  set  will  have  a  certain 
number  of  unique  references  to  a  given  cache  line.  The  derivation  of  this  function  is  quite  lengthy, 
for  more  information  please  consult  either  paper.  It  is  based  on  the  binomial  probability  that  any 
given  cache  reference  will  be  assigned  to  a  certain  cache  line. 

The  calculation  is  a  function  of  the  number  of  cache  lines  N,  the  cache  associativity  A, 
and  the  footprint  F  of  the  interruption,  defined  as  the  number  of  unique  blocks  referenced  by  the 
program  in  the  interval  under  consideration.  The  probability  that  a  given  cache  line  will  contain  i 
references  from  a  certain  footprint  is  defined  as; 

if  0  <  i  <  A  : 
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(9) 


if  i  =  A  : 

The  probability  that  a  certain  number  of  blocks  will  be  used  on  any  given  line  directly  determines 
the  probable  number  of  blocks  that  must  be  evicted  from  that  line  during  the  interruption. 

Unfortunately,  this  function  cannot  be  inverted  to  give  a  direct  calculation  of  the  number  of 
blocks  overwritten  in  each  line  based  on  a  single  variable  in  [0..1].  Instead,  a  random  probability  p 
is  generated  for  each  line  in  each  cache  and  the  following  algorithm  is  used  to  iterate  over  all  values 
of  a  in  the  range  [0..-4  —  1]  to  determine  the  number  of  overwrites  to  be  performed  on  that  line: 

a 

if  p  >  ^2  Pi  j  a  -f  1  overwrites  are  performed  (12) 

1=0 

Based  on  [56],  the  overwrites  caused  by  this  function  follow  a  roughly  normal  distribution. 
Figures  36  and  37  show  the  probability  of  n  overwrites  per  line,  P(n),  for  a  context  switch  with 
interruption  footprints  of  100  and  1000  respectively.  Various  associativities  and  their  possible  re¬ 
placements  are  shown,  with  the  replacement  probability  plotted  against  the  number  of  lines  in  the 
cache  —  showing  the  decreasing  likelihood  of  replacement  as  cache  size  increases  or  footprint  size 
decreases. 

Certain  assumptions  apply  to  the  formulas  provided  in  the  papers.  These  equations  assume 
that  a  program’s  footprint  is  uniformly  distributed  over  the  cache.  The  locality  in  reference  streams 
would  suggest  that  this  is  not  true,  which  was  supported  by  the  results  in  both  papers.  Using 
other  mapping  algorithms  (hashing),  it  may  be  possible  to  get  a  more  uniform  distribution,  but  this 
technique  was  not  used.  Finally,  shared  references  between  programs  are  neglected.  As  discussed 
before,  given  the  test  programs  used  and  the  way  the  kernel  is  considered,  this  is  a  reasonable 
assumptions.  To  analyze  a  threaded  program,  or  one  with  a  substantial  shared  component  (such  as 
a  database),  such  an  assumption  is  not  valid. 
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Percent  Chance  of  Replacement  Percent  Chance  of  Replacement 


other  assumptions  made  in  the  papers  are  no  longer  relevant.  The  use  of  LRU  replacement 
is  assumed  in  the  analytical  model,  but  incorporated  explicitly  in  simulation.  The  LRU  blocks  are 
selected  for  overwrite,  but  other  selection  methods  are  possible.  Also,  other  considerations  such  as 
which  cache  lines  present  at  a  context  switch  will  be  referenced  after  the  interruption  period  do  not 
have  to  be  modeled,  since  they  are  determined  by  the  simulation. 

The  remaining  problem  is  determining  the  footprint  of  the  interruption.  The  footprint 
depends  on  the  process  being  considered,  its  state  of  execution,  and  the  line  size  of  the  cache,  so  is 
very  ditEcult  to  characterize.  In  [3,  56]  detailed  analyses  of  program  traces  were  used  to  determine 
this  value.  This  is  not  compatible  with  our  goal  of  minimal  analysis  in  developing  the  model,  so 
a  different,  more  improvised,  approach  is  used.  Based  on  the  footprint  values  used  in  other  work 
[3,  56],  a  reasonable  (though  less  accurate)  range  can  be  achieved  using: 


p.  - 

"  50*  B 

(13) 

„  _  Tint 

(14) 

which  gives  the  instruction  footprint  as  2%  of  the  execution  interval  of  the  interruption  (r^nt)  divided 
by  the  block  size  (B)  in  words  (or  in  bytes  divided  by  4),  and  the  data  footprint  is  simply  2%  of  the 
execution  interval.  This  is  obviously  an  overly  simplified  approach  to  characterizing  the  footprint, 
but  adequate  for  an  initial  review.  For  a  unified  cache,  the  two  footprints  are  simply  summed,  which 
is  correct  assuming  independence  of  instruction  and  data  references  (no  self  modifying  code).  For  a 
range  of  intervals  [0. .250,000],  this  produces  a  footprint  range  of  [0..5625]  for  the  caches  simulated. 

The  execution  interval  of  the  interruption  is  computed  as 

Tint  =  n*  -Mln(l  -p)  (15) 

where  n  is  the  number  of  additional  processes  being  executed  according  to  the  model  and  p  is  a 
random  value  in  [0,.l]  as  used  before.  This  is  consistent  with  the  round  robin  scheduling,  as  the 
number  of  processes  being  executed  determines  the  length  of  interruption.  One  problem  is  that  the 
models  used  in  both  [3,  56]  neglect  the  operating  system.  For  simplicity,  the  operating  system  is 
modeled  as  just  another  process:  to  simulate  a  process  with  the  operating  system,  n  =  1;  with  the 
operating  system  and  one  other  process,  n  =  2;  and  so  on.  This  may  be  pessimistic,  as  one  might 
expect  that  system  calls  and  interrupt  service  routines  to  be  shorter  than  user  programs,  however 
the  distribution  of  execution  intervals  is  weighted  towards  shorter  intervals,  which  is  consistent  with 
frequent  interruptions. 
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The  impact  is  applied  in  software  every  time  a  context  switch  is  indicated.  The  length  of 
the  interruption  is  computed,  which  in  turn  defines  the  footprint  for  the  various  unified,  instruction, 
and  data  caches.  This  is  used  to  calculate  the  probability  that  a  given  number  of  cache  blocks 
are  overwritten  for  each  cache  line  in  each  different  cache  configuration.  Then  for  each  cache  line 
a  random  number  in  [0..1]  is  generated  and  compared  to  the  probability  to  determine  how  many 
blocks  on  that  line  (up  to  the  set  size)  are  invalidated. 

6.4  Testing 

The  mechanism  described  above  was  incorporated  into  the  same  program  used  for  the  single 
processes  simulations  described  in  section  5.  The  additional  code  is  also  included  in  appendix  A. 
Again  a  tool  was  defined  to  instrument  the  test  programs  (called  mod)  so  shared  library  functions 
could  be  used  in  analysis.  Simulations  with  the  model  were  performed  using  the  same  40  caches 
on  all  four  benchmarks  for  n  =  1,  modeling  the  program  with  the  operating  system.  Simulations 
were  also  performed  for  n  —  2  for  Compress,  GCC,  and  Espresso,  to  compare  the  model  results  to 
simulations  of  two  concurrent  processes  with  the  operating  system.  All  simulations  were  performed 
on  the  same  Alpha  system  as  before.  The  results  of  the  model  simulations  are  reviewed  in  the  next 
section,  and  compared  with  their  equivalent  ”real”  simulations. 
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7  Model  Evaluation 

7.1  Individual  Results  for  n=l 


The  accuracy  of  the  context  switch  model  can  be  seen  in  its  ability  to  predict  cache  miss 
rates  commensurate  with  those  generated  from  an  equivalent  “real”  simulation.  The  first  test  case 
was  for  n=l,  modeling  the  test  program  with  one  additional  process,  the  operating  system,  which 
was  performed  for  Compress,  GCC,  Espresso,  and  Alvinn.  The  results  of  these  simulations  are 
plotted  against  the  corresponding  real  simulation  of  each  program  with  the  operating  system,  shown 
in  Figures  38  to  41. 

As  can  be  seen,  the  model  generally  provides  an  adequate  mechanism  for  predicting  the 
interference  caused  by  operating  system  overhead.  There  are  some  variations  over  the  results, 
although  certain  instances  such  as  Alvinn  data  references  are  quite  accurate.  Such  variations  are 
to  be  expected  given  the  assumptions  that  were  used  to  generate  the  model.  The  only  significant 
fluctuations  occur  for  Compress,  which  is  logical  considering  that  benchmark  interacts  substantially 
more  with  the  operating  system  than  the  others. 

7.2  Individual  Results  for  n=2 

A  better  test  of  the  model  is  for  n=:2,  modeling  the  effects  of  the  operating  system  and 
an  additional  process  on  the  performance  of  the  test  program.  Simulations  were  performed  for 
Compress,  GCC,  and  Espresso;  Alvinn  Wcis  neglected  since  no  corresponding  real  simulation  could 
be  performed.  These  results  are  shown  in  Figures  42  to  44. 

These  results  show  the  weakness  of  the  model.  In  almost  every  case,  the  model  predictions 
are  more  optimistic  than  the  real  data.  Also,  the  model  does  not  account  for  differences  in  program 
behavior,  so  while  there  are  two  sets  of  real  data  from  two  alternative  second  programs,  the  model 
only  predicts  a  single  result.  Based  on  this,  the  model  does  not  accurately  predict  the  amount  of 
interference  generated  from  multitasking.  The  error  in  the  model  should  also  be  more  pronounced 
as  the  level  of  multitasking  is  increased,  but  no  simulations  could  be  performed  with  3  test  programs 
or  more  to  verify  this. 

7.3  Interference  Comparison 

The  primary  source  of  error  in  the  model  is  apparent  in  the  interference  plots.  These  are 
equivalent  to  the  interference  figures  of  the  previous  results,  showing  what  percentage  of  cache  misses 
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Data  References,  A=4 
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.  -  ^  .  ,  w/  OS  &  Compress  (S=32768) 
_  w/  OS  &  Espresso  (S=32768) 
^  w/  model  {S=32768) 


.  _  ^  .  .  w/  OS  &  Compress  (S=4096) 
_  .  w/  OS  &  Espresso  (S=4096) 

^  w/  model  (s=4096) 

.  .  ^  .  ,  w/  OS  &  Compress  (S=8192) 
m/^mm  .  w/ OS  &  Esprosso  (S=81 92) 


w/ model  (S=8192) 

w/  OS  &  Compress  (S=16384) 

w/  OS  &  Espresso  (S^: 16384) 

w/ model  (S= 16384) 

w/  OS  &  Compress  (S=32768) 

w/  OS  &  Espresso  (S=32768) 

w/  model  (S=32768) 


„  .  _  _  w/ OS  &  Compress  (S=4096) 

_  .  w/  OS  &  Espresso  (S=4096) 


♦  . 

-  -  'A'  -  - 

—  -A"  - 

- - 

"  HO-  - 


.  -  O  -  - 


w/  model  {s=4096) 

w/  OS  &  Compress  (S=8192) 

w/OS  &  Espresso  (S=8192) 

w/ model  (S=8192) 

w/  OS  &  Compress  {S=:16384) 

w/  OS  &  Espresso  (S= 16384) 

w/ model  (S=16384) 

w/  OS  &  Compress  (S=32768) 

w/  OS  &  Espresso  (S=32768) 


w/  model  (S=32768) 


_  _  _  w/  OS  &  Compress  (S=4096) 


—  -o- 
— #— 

—  -A— 

— A— 

—  -  o  - 

—  -n- 


.  w/ OS  &  Espresso  (S=4096) 
__  w/  model  (s=4096) 

_  w/  OS  &  Compress  (S=8192) 

_  w/  OS  &  Espresso  (S=8192) 
^  w/  model  (S=8192) 

_  w/  OS  &  Compress  (S= 16384) 
.  w/ OS  &  Espresso  (S=1 6384) 
_w/ model  (S=16384) 

.  w/  OS  &  Compress  (S=32768) 
.  w/  OS  &  Espresso  (S=32768) 
^  w/  model  (S=32768) 
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Instruction  References,  A=1 


2.5 


Instruction  References,  A=4 


Data  References,  A=1 
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Data  References,  A=4 


^  ,  w/  os  &  Compress  (3=4096) 

. .  ,  w/ OS  &  GCC  (S=4096) 


- •— 

—  -A- 

- A— 

-  -  O  - 
“H3- 


-  -  O  - 


_  w/  model  (s=4096) 

_  w/  OS  &  Compress  (S=8192) 

.  w/OS&GCC(S=8192) 

_w/ model  (S=81 92) 

_  w/  OS  &  Compress  (S=16384) 
.  w/OS&GCC(S=16384) 
^w/ model  (S= 16384) 

,  w/  OS  &  Compress  (S=32768) 


w/  OS  &  GCC  (S=32768) 
w/  model  (S=32768) 


- A— 

"■  ••  “  - 

^  -A-  - 

- A - 

.  -  O  -  - 

—  -o-  - 


-  -  o  -  - 


w/  OS  &  Compress  (S=4096) 
w/OS&GCC  (S=4096) 
w/  model  (s=4096) 
w/  OS  &  Compress  (S=8192) 
w/ OS  &  GCC  (S=8192) 
w/  model  (S=8192) 
w/  OS  &  Compress  (8=16384) 
w/  OS  &  GCC  (S=16384) 
w/  model  (S=16384) 
w/  OS  &  Compress  (S=32768) 
W/  OS  &  GCC  (S=32768) 
w/  model  (S=32768) 


w/  OS  &  Compress  (S=4096) 

w/  OS  &  GCC  (S=4096) 

w/  model  (s=4096) 

w/  OS  &  Compress  (S=8192) 

w/OS&GCC(S=8192) 

w/ model  (S=8192) 

w/  OS  &  Compress  (S=16384) 

w/  OS  &  GCC  (S=16384) 

w/ model  (S= 16384) 

w/  OS  &  Compress  (S=32768) 

w/  OS  &  GCC  (S=32768) 

w/  model  (S=32768) 


♦ 

—  -A—  - 

- A - 

—  -  O  -  - 

—  -o-  - 


-  -  o  -  - 


w/  OS  &  Compress  (S=4096) 
w/  OS  &  GCC  (S=4096) 
w/  model  (s=4096) 
w/  OS  &  Compress  (S=8192) 
w/  OS  &  GCC  (S=8192) 
w/  model  (S=8192) 
w/  OS  &  Compress  (S= 16384) 
w/OS&GCC  (S=16384) 
w/  model  (S=16384) 
w/  OS  &  Compress  (S=32768) 
w/  OS  &  GCC  (S=32768) 
w/  model  (S=32768) 
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overwrote  a  process’  own  data  («  intrinsic  interference),  as  opposed  to  overwriting  another  processes 
data  («  extrinsic  interference).  These  plots  are  shown  for  each  of  the  seven  test  cases  in  Figures  45 
through  51. 

As  can  be  seen,  the  model  underestimates  the  amount  of  extrinsic  interference  present  in  a 
multitasked  situation.  With  a  second  program  in  the  model,  the  primary  source  of  interference  is 
still  intrinsic,  as  seen  by  the  percentage  of  self  overwrites,  which,  based  on  the  previous  results,  is 
inaccurate.  The  only  instances  the  model  is  even  remotely  correct  is  for  the  largest  caches  for  GCC 
and  Espresso. 

Given  the  fact  that  the  operating  system  is  modeled  fairly  accurately,  but  the  impact  for 
other  programs  is  not,  the  most  likely  source  of  error  is  in  the  impact  to  the  cache  at  each  context 
switch.  The  switch  frequency  is  assumed  to  be  more  accurate.  This  is  also  supported  by  the  assump¬ 
tions  used  to  develop  the  model.  The  most  likely  source  of  error  is  the  footprint  characterization. 
Using  a  simple  function  of  the  interruption  interval  is  obviously  an  oversimplification.  A  more  ac¬ 
curate  model  could  be  developed  by  using  a  more  flexible  model  of  footprint  size  and  composition 
based  on  program  features. 

7.4  Summary 

Based  on  the  above  results,  the  model  described  in  section  6  does  not  adequately  intro¬ 
duce  the  impact  of  context  switches  into  a  single  process  simulation.  The  interference  generated 
approaches  the  level  caused  by  the  operating  system,  but  is  not  significant  enough  to  represent  ad¬ 
ditional  user  programs.  Given  the  assumptions  used  to  develop  the  model,  the  most  likely  source 
of  error  is  in  the  realization  of  context  switch  impact,  in  particular  the  computation  of  the  program 
footprint.  The  method  used  was  overly  simplified,  especially  the  relationship  between  block  size  and 
program  footprint. 

The  difficulty  of  developing  an  accurate  context  switch  model  highlights  the  complexity  of 
the  cache  environment.  Cache  performance  is  an  intricate  subject,  and  some  aspects  are  not  well 
understood.  Analytical  models  can  facilitate  evaluation,  but  at  the  expense  of  accuracy.  Any  model 
will  have  to  find  a  balance  between  these  two  goals.  The  requirement  for  accuracy  reaffirms  the  need 
for  analysis  tools  as  described  earlier,  despite  their  own  limitations. 
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Cache  # 


Figure  45:  Percent  Self  Overwritten  for  Compress;  n=l 
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Figure  46:  Percent  Self  Overwritten  for  GCC;  n=l 


Cache  # 

Figure  47:  Percent  Self  Overwritten  for  Espresso;  n=l 
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Cache  # 


Figure  48:  Percent  Self  Overwritten  for  Alvinn;  n=l 
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7.5  Future  Work 


While  the  model  was  not  particularly  successful  in  predicting  interference,  it  does  provide 
a  theoretical  foundation  for  further  exploration.  As  discussed  above,  the  primary  limitation  is  the 
simplistic  treatment  of  process  footprints.  Were  this  to  be  resolved  and  the  footprints  consider  both 
the  program  in  question  and  the  cache  block  size,  the  model  should  perform  much  better. 

Other  potential  improvements  are  a  more  detailed  characterization  of  the  operating  system, 
to  include  its  various  composite  threads.  Also,  the  footprint  of  the  operating  system  processes  must 
be  considered  differently  than  user  programs,  due  to  their  unique  nature.  The  execution  interval 
function  can  also  be  improved,  by  including  specific  program  characteristics  such  as  the  frequency 
of  system  calls  and  interrupts  generated  by  that  particular  program.  Finally,  additional  aspects  of 
the  various  existing  analytical  models  can  be  incorporated  to  further  simplify  the  simulations.  A 
better  understanding  of  the  execution  environment  will  allow  more  realistic  assumptions  to  be  used 
in  that  case. 
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8  Conclusions 


The  primary  thrust  of  this  research  was  the  development  and  refinement  of  the  ATOM  based 
simulation  capability  for  a  complex  workload.  This  was  accomplished  through  the  development  of  a 
very  flexible  and  robust  analysis  program.  This  program  is  based  on  standard  simulation  tools,  but 
incorporates  novel  techniques  to  allow  a  more  comprehensive  analysis.  Partially  based  on  the  current 
work  of  others,  many  of  these  techniques  still  required  extensive  test  and  adaptation  before  their 
performance  was  adequate.  Other  areas,  such  as  re-entrant  analysis,  were  totally  original.  Several 
avenues  of  future  work  have  also  been  highlighted,  based  on  developing  this  work  into  an  even  more 
mature  tool. 

The  cache  simulations  were  performed  as  a  demonstration  of  the  overall  potential  of  the 
simulation  capability,  as  well  as  reinforcing  assumptions  about  cache  performance  with  operating 
system  overhead  and  in  the  multiprocess  environment.  The  context  switch  model  attempted  to 
combine  both  empirical  and  theoretical  understanding  of  caches,  and  the  testing  portrayed  a  specific 
application  of  the  ATOM  tools  created.  These  results  were  generally  consistent  with  past  endeavors, 
although  highlighted  some  possible  deficiencies  in  current  methods  and  assumptions.  The  execution 
environment  is  quite  complex,  and  aspects  of  its  behavior  are  not  particularly  well  understood. 
The  ATOM  tool  promises  to  be  a  very  effective  and  flexible  tool  for  robust  computer  architecture 
analysis,  however  further  work  is  necessary  to  fully  realize  its  potential. 

In  the  final  analysis,  the  consideration  of  cache  miss  rates  must  be  weighed  with  the  impact 
of  those  miss  rates  on  overall  memory  system  performance.  The  actual  goal  of  a  cache  is  to  improve 
memory  access  times.  A  cache  with  a  very  low  miss  rate  but  with  a  slow  access  time  is  just  as  much 
a  problem  as  a  cache  with  a  high  miss  rate  but  very  fast  access  time.  Trafl&c  between  the  various 
levels  of  the  memory  hierarchy  will  also  play  a  factor,  as  the  time  to  service  a  miss  is  also  important. 
Other  factors  such  as  the  area  and  power  required  for  the  cache  must  also  be  considered  for  an 
accurate  appraisal  of  the  cost  and  benefits  of  incorporating  a  certain  cache  design  into  a  system. 
This  work  has  been  the  first  step  towards  such  appraisals  which  include  a  comprehensive  workload. 
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9  Contributions  of  this  Thesis 


•  The  majority  of  the  work  described  in  this  thesis  has  revolved  around  developing  the  ATOM 
tracing  capability  for  the  operating  system  and  multiple  user  programs.  Previous  work  in 
this  particular  area  is  almost  non-existent.  ATOM  itself  is  a  well  defined  tool,  but  this  type 
of  implementation  has  not  been  studied  before.  A  general  method  to  instrument  the  kernel 
is  outlined  by  Eustace  and  Chen  in  [20],  but  not  well  explored.  Their  material  was  used  as 
a  foundation,  but  expanded  upon  to  develop  the  next  generation  of  tools.  The  testing  and 
refinement  performed  over  the  past  year  have  made  advances  in  several  areas: 

—  The  cache  simulation  tools  developed  are  much  more  comprehensive  than  any  existing 
ATOM  programs,  providing  more  flexibility  and  detailed  results. 

-  The  techniques  proposed  by  Eustace  and  Chen  have  been  extended  to  include  not  only 
the  operating  system  but  multiple  user  programs. 

—  The  issue  of  re-entrant  analysis  functions  was  explored  for  the  first  time.  This  will  play 
a  critical  role  in  the  exploration  of  certain  applications  such  as  the  operating  system. 

“  Other  limitations  associated  with  using  ATOM  on  the  kernel  are  now  more  fully  under¬ 
stood.  Some  were  addressed  in  this  work,  while  others  will  require  further  study  to  be 
completely  resolved. 

•  The  cache  simulations  served  as  a  validation  of  the  tools  developed.  The  results  confirmed  the 
necessity  for  this  type  of  work,  revealing  the  significance  of  multiprogramming  in  workloads. 
The  data  gathered  has  affirmed  theories  about  cache  performance,  and  can  be  used  to  design 
more  efficient  memory  caches. 

•  The  context  switch  model  attempts  to  combine  both  theoretical  and  empirical  cache  studies  in 
an  effort  to  achieve  a  balance  between  simplicity  and  accuracy.  It  is  an  extension  of  the  basic 
cache  model  which  synthetically  generates  the  impact  of  multiprogramming.  While  not  entirely 
successful,  the  testing  does  highlight  gaps  in  current  understanding  of  cache  performance  in 
a  complex  environment.  This  will  serve  cis  a  background  for  more  appropriate  models,  which 
should  successfully  reduce  simulation  processing. 

•  The  most  significant  aspects  of  this  thesis  are  the  potential  contributions  to  future  work.  With 
the  capability  developed  here,  a  wide  variety  of  additional  cache  studies  are  possible.  With 
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some  relatively  minor  modification,  the  tools  developed  can  be  adapted  to  a  wide  variety  of 
program  analyses.  Most  importantly,  this  work  will  provide  the  foundation  to  allow  these 
studies  to  include  the  operating  system,  a  subject  that  has  not  be  well  addressed  in  the  past. 
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A  Program  Source  Code 

Programs  are  based  primarily  on  the  structure  developed  in  [20]  and  past  work  from  [23,  24]. 
Other  sources  for  information  include  [44,  66,  67,  68,  73,  74].  The  input  and  output  file  formats  are 
shown  first  with  short  examples,  followed  by  the  various  files  and  programs  used.  They  are  provided 
as  a  reference  for  future  efforts  as  well  cis  to  help  understanding  of  the  material: 

1.  Input  Format  and  Example 

2.  Output  Format  and  Example 

3.  Cache  Model  Library 

4.  Kernel  Instrumentation  File 

5.  Kernel  Analysis  File 

6.  Program  Instrumentation  File 

7.  Program  Analysis  File 

8.  Sample  Tool  Description  File 

9.  Context  Switch  Model  Library 

10.  Model  Analysis  File 
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A-1  Input  Format 

The  input  file  must  be  called  cache .  in  and  has  the  format: 

•  (simulation  name) 

•  (number  of  processes  in  simulation) 

•  (name  of  each  process  (n-1  names,  process  0  is  assumed  to  be  the  OS) 

•  (number  of  caches  in  simulation) 

•  (cache  definitions) 


Names  can  contain  up  to  80  characters.  Cache  definitions  consist  of  two  lines.  The  first  is  a  0  or  1 
denoting  the  cache  type.  The  second  contains  the  cache  parameters  in  the  forms  shown  below  based 
on  cache  type: 

Unified(O)  (U  cache  size)  (U  block  size)  (U  associativity) 

Split(l)  (I  cache  size)  (I  block  size)  (I  associativity)  (D  cache  size)  (D  block  size)  (D  associativity) 

An  short  example  input  file  is  shown  below: 

multi  process  test 
3 

ccl  “0  -quiet  stmt.i  -o  stmt 
espresso  tial.in  >  /dev/null 
3 
0 

16384  64  2 
1 

16384  128  4  16384  128  4 
1 

32768  256  1  32768  256  1 
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A. 2  Output  Format 

The  simulation  results  were  dumped  to  a  file  called  cache. out.  The  output  format  has  a 
banner  page  followed  by  a  page  of  results  for  each  cache.  Results  are  recorded  at  the  end  of  each 
program  in  the  simulation,  however  the  second  set  of  data  was  removed  from  the  example  for  brevity. 
The  format  is  self  evident  from  the  example  shown  below.  In  hindsight,  the  output  file  should  have 
used  a  format  directly  readable  by  a  spreadsheet  program.  The  format  below  is  easy  to  understand, 
however  it  also  requires  manual  entry  of  data  into  spreadsheets  for  analysis. 


<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 
SIMULATION:  multi  process  test 

<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><> 
Number  Tasks  =  3 
#0:  kernel 

#1:  ccl  -0  -quiet  stmt.i  -o  stmt 
#2:  espresso  tial.in  >  /dev/null 
Number  Caches  =  3 

(type,  icsize,  ilsize,  iassoc,  dcsize,  dlsize,  dassoc) 


#0: 

0 

16384 

64 

2 

#1: 

1 

16384 

128 

4 

16384 

128 

4 

#2: 

1 

32768 

256 

1 

32768 

256 

1 

DATA  AT  END  OF  PROCESS  1 

<><><><><><><><><><><><><><><><><><><><><><><><><> 
simulation:  multi  process  test 

(data  at  end  of  process  1) 


CACHE  #  0 

cache  type:  0  (0=unified,  l=split) 
icache  size:  16384 
icache  line  size:  64 
icache  associativity:  2 

3|c:|c:)c3)c:fc:)e:|c:|c:»::4c 


Process  #0 


Inst  39004710  Miss 

Data  16350661  Miss 

read  10758087  Miss 

writ  5592574  Miss 

TOTAL  55355371  Miss 


2739339  Perc  7.023098 
3071643  Perc  18.786048 
2366717  Perc  21.999422 
704926  Perc  12.604679 
5810982  Perc  10.497594 


Interferance  (number  times  process  0  overwrote:) 
Process  0  =  2614797 

Process  1  =  2207422 
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Process  2  =  988510 

Process  3  =  253 

(process  3  is  invalid  data) 

Process  #1 

Inst  160240175  Miss  5166542  Perc  3.224249 

Data  69272178  Miss  4512864  Perc  6.514685 

read  50197333  Miss  3475694  Perc  6.924061 

writ  19074845  Miss  1037170  Perc  5.437371 

TOTAL  229512353  Miss  9679406  Perc  4.217379 

Interferance  (mimber  times  process  1  overwrote:) 
Process  0  =  2175838 

Process  1  =  4910549 

Process  2  =  2287801 

Process  3  =  3 

(process  3  is  invalid  data) 

Process  #2 

Inst  224015943  Miss  1813316  Perc  0.809458 

Data  63229661  Miss  3257726  Perc  5.152212 

read  51131731  Miss  2778587  Perc  5.434174 

writ  12097930  Miss  479139  Perc  3.960504 

TOTAL  287245604  Miss  5071042  Perc  1.765403 

Interferance  (mimber  times  process  2  overwrote:) 
Process  0  =  1020129 

Process  1  =  2561443 

Process  2  =  1489470 

Process  3  =  0 

(process  3  is  invalid  data) 

TOTAL  FOR  CACHE 

Inst  423260828  Miss  9719197  Perc  2.296267 

Data  148852500  Miss  10842233  Perc  7.283877 

read  112087151  Miss  8620998  Perc  7.691335 

writ  36765349  Miss  2221235  Perc  6.041654 

TOTAL  572113328  Miss  20561430  Perc  3.593944 

simulation:  multi  process  test 

(data  at  end  of  process  1) 


CACHE  #  1 

cache  type:  1  (0=unified,  l=split) 

icache  size:  16384 

icache  line  size:  128 

icache  associativity:  4 

dcache  size:  16384 

dcache  line  size:  128 

dcache  associativity:  4 

Process  #0 

Inst  39028217  Miss  1297351  Perc  3.324136 

Data  16360315  Miss  2091714  Perc  12.785292 
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read  10764480  Miss  1706268  Perc  15.850910 

writ  5595835  Miss  385446  Perc  6.888087 

TOTAL  55388532  Miss  3389065  Perc  6.118712 

Interf eraiLce  (number  times  process  0  overwrote:) 
Process  0  =  1358317 

Process  1  =  1358773 

Process  2  =  671722 

Process  3  =  253 

(process  3  is  invalid  data) 

:|e:te*3|c:fc3te:)c4e3te3|c 

Process  #1 

Inst  160240175  Miss  2378836  Perc  1.484544 

Data  69272178  Miss  2370733  Perc  3.422345 

read  50197333  Miss  1965331  Perc  3.915210 

writ  19074845  Miss  405402  Perc  2.125323 

TOTAL  229512353  Miss  4749569  Perc  2.069418 

Interference  (number  times  process  1  overwrote:) 
Process  0  =  1356440 

Process  1  =  2358083 

Process  2  =  1945071 

Process  3  =  3 

(process  3  is  invalid  data) 

**:(e*:jc  +  +  *  +  * 

Process  #2 

Inst  224033574  Miss  652803  Perc  0.291386 

Data  63235212  Miss  1542671  Perc  2.439576 

read  51136035  Miss  1321124  Perc  2.583548 

writ  12099177  Miss  221547  Perc  1.831091 

TOTAL  287268786  Miss  2195474  Perc  0.764258 

Interference  (number  times  process  2  overwrote:) 
Process  0  =  674120 

Process  1  =  993262 

Process  2  =  488640 

Process  3  =  0 

(process  3  is  invalid  data) 

:*c  **  +  :»£**♦  :tc  ♦**  * 

TOTAL  FOR  CACHE 

Inst  423301966  Miss  4328990  Perc  1.022672 

Data  148867705  Miss  6005118  Perc  4.033862 

read  112097848  Miss  4992723  Perc  4.453897 

writ  36769857  Miss  1012395  Perc  2.753329 

TOTAL  572169671  Miss  10334108  Perc  1.806126 

simulation:  multi  process  test 

(data  at  end  of  process  1) 


CACHE  #  2 

cache  t3rpe:  1  (0=unified,  l=split) 
icache  size:  32768 
icache  line  size:  256 
icache  associativity:  1 
dcache  size:  32768 
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dcache  line  size:  256 
dcache  associativity:  1 

Process  #0 

Inst  39100207  Miss  877237  Perc  2.243561 

Data  16384285  Miss  2191363  Perc  13.374786 

read  10780440  Miss  1793502  Perc  16.636631 

writ  5603845  Miss  397861  Perc  7.099786 

TOTAL  55484492  Miss  3068600  Perc  5.530554 

Interferance  (number  times  process  0  overwrote:) 
Process  0  =  1704283 

Process  1  =  946851 

Process  2  =  417213 

Process  3  =  253 

(process  3  is  invalid  data) 

Process  #1 

Inst  160240175  Miss  1414353  Perc  0.882646 

Data  69272178  Miss  2717362  Perc  3.922732 

read  50197333  Miss  2261685  Perc  4.505588 

writ  19074845  Miss  455677  Perc  2.388890 

TOTAL  229512353  Miss  4131715  Perc  1.800215 

Interferance  (number  times  process  1  overwrote:) 
Process  0  =  942089 

Process  1  =  2260273 

Process  2  =  929350 

Process  3  =  3 

(process  3  is  invalid  data) 

Process  #2 

Inst  224033574  Miss  435774  Perc  0.194513 

Data  63235212  Miss  2459827  Perc  3.889964 

read  51136035  Miss  2205351  Perc  4.312714 

writ  12099177  Miss  254476  Perc  2.103250 

TOTAL  287268786  Miss  2895601  Perc  1.007976 

Interferance  (number  times  process  2  overwrote:) 
Process  0  =  422012 

Process  1  =  924590 

Process  2  =  1548999 

Process  3  =  0 

(process  3  is  invalid  data) 

3fc  *  sic  *  *  *  4c  *  *  *  *  *  :4c  :ile  ic  :(e  :fc 

TOTAL  FOR  CACHE 

Inst  423373956  Miss  2727364  Perc  0.644197 

Data  148891675  Miss  7368552  Perc  4.948935 

read  112113808  Miss  6260538  Perc  5.584092 

writ  36777867  Miss  1108014  Perc  3.012720 

TOTAL  572265631  Miss  10095916  Perc  1.764201 

DATA  AT  END  OF  PROCESS  2 

<><><><><><><><><><><><><><><><><><><><><><><><><> 

(format  repeats  for  data  at  end  of  second  process) 
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A. 3  Cache  Model  Library 

The  following  file,  cache. h,  Wcis  used  as  a  definition/procedure  library  for  the  basic  cache 
simulator: 

/*  CACHE. H  */ 

/*  CACHE  SIMULATION  LIBRARY  */ 

/*  JOHN  FRASER  */ 

/*  SIMULATION  CHARACTERISTICS  */ 

/*  MAXIMUM  NUMBER  OF  CACHES  IN  SIMULATION  */ 

#define  MAXCACHES  40 

/*  MAXIMUM  NUMBER  OF  PROCESSES  IN  SIMULATION  */ 

#define  MAXTASKS  4 

/*  MAXIMUM  NUMBER  OF  LINES  (CSIZE/(BSIZE*ASSOC))  IN  CACHES  */ 

#define  MAXLINE  512 

/*  MAXIMUM  ASSOCIATIVITY  OF  CACHES  */ 

#define  MAXASSOC  4 

/*  CACHE  PARAMETERS  */ 
typedef  struct 

/*  CACHE  TYPE  (0=UNIFIED,  1=SPLIT)  */ 
int  tjrpe; 

/*  CACHE  SIZE  FOR  EACH  SECTION  (0=UNIFIED/INST.  1=DATA)  */ 
int  c size [2]; 

/=•=  BLOCK  SIZE  FOR  EACH  SECTION  */ 
int  bsizeC2]; 

/*  ASSOCIATIVITY  FOR  EACH  SECTION  */ 
int  assoc [2]; 

/*  BIT  SHIFT  USED  TO  ISOLATE  TAG  FROM  ADDRESS  */ 
int  t shift [2]; 

/=*=  BIT  SHIFT  USED  TO  ISOLATE  LINE  FROM  ADDRESS  */ 
int  Ishif t  [2] ; 

/*  BIT  MASK  USED  TO  ISOLATE  LINE  FROM  ADDRESS  */ 
int  Imask [2] ; 

}  paxcim; 

/*  CACHE  BLOCK  STORAGE  */ 
tjrpedef  struct 
{ 

/*  BLOCK  TAG  */ 
long  tag; 

/*  BLOCK  ’USE  BITS’  FOR  ASSOCIATIVE  CACHES  */ 
unsigned  long  use; 

/*  BLOCK  OWNER  PROCESS  */ 
int  task; 

]■  block; 

/*  CACHE  PERFORMANCE  STATISTICS  */ 
typedef  struct 
{ 
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t*  NUMBER  OF  INSTRUCTION  FETCHES  */ 
unsigned  long  instcnt; 

I*  NUMBER  OF  DATA  LOADS  */ 
unsigned  long  readcnt; 

/*  NUMBER  OF  DATA  STORES  */ 
unsigned  long  writcnt; 

/*  NUMBER  OF  OVERWRITES  OVER  EACH  PROCESS  */ 
/*  NUMTASKS+1  =  INVALID  DATA  */ 
unsigned  long  interfere [MAXTASKS+1] ; 

/*  NUMBER  OF  INSTRUCTION  FETCH  MISSES  */ 
unsigned  long  instmisscnt; 

/*  NUMBER  OF  DATA  LOAD  MISSES  */ 
unsigned  long  readmisscnt; 

/*  NUMBER  OF  DATA  STORE  MISSES  *l 
unsigned  long  writmisscnt; 

}  stats; 

/*  STRING  DEFINITION  */ 
tjrpedef  char  string  [80]  ; 

/*  SHARED  ATOM  DATA  */ 
typedef  struct 

■C 

/*  NUMBER  OF  CACHES  IN  USE  */ 
int  numc aches; 

/*  NUMBER  OF  CACHES  IN  SIMULAITON  */ 
int  actcaches; 

/*  NUMBER  OF  PROCESSES  IN  SIMULATION  */ 
int  numtasks; 

/*  NUMBER  OF  PROCESSES  CURRENTLY  EXECUTING  */ 
int  count; 

/*  PID  OF  CURRENT  PROCESS  */ 
int  curt ask; 

/♦  PROCESS  NAMES  */ 
string  name [MAXTASKS] ; 

/*  CACHE  PARAMTERS  */ 
param  para[MAXCACHES] ; 

/*  CACHE  STATE  (BLOCK  INFORMATION)  */ 
block  dataCMAXCACHES] [2] [MAXLINE] [MAXASSOC] ; 
/*  PERFORMANCE  STATISTICS  */ 
stats  stat[MAXCACHES] [MAXTASKS]; 

}  datablock; 

/*  INTEGER  L0G2  FUNCTION  */ 
int  inylog2(int  num) 

{ 

if  (num  <  2) 
retum(O) ; 
else 

retumd  +  mylog2(num/2)) ; 

} 
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A. 4  Kernel  Instrumentation  File 

The  kernel  instrumentation  file  kern,  inst .  c  is  responsible  for  adding  the  calls  to  the  analy¬ 
sis  routines  at  the  appropriate  points.  A  call  to  the  initialization  function  is  mcide  when  the  program 
is  initiedly  loaded,  and  thereafter  at  each  data  reference  and  sets  of  instructions,  calls  are  made  to  the 
various  analysis  routines.  A  call  is  inserted  at  the  start  of  each  hardclock  interrupt  service  routine 
for  scaling  purposes.  Note  the  test  to  check  for  the  kernel  procedures  which  cannot  be  instrumented. 

/♦  KERN.IHST.C  */ 

/*  KERNEL  INSTRUMENTATION  FILE  */ 

/*  JOHN  FRASER  */ 

#include  <string.h> 

#include  <cmplrs/atom . inst .h> 

I*  DEFINE  PROCESS  ID  */ 

#define  PROCNUM  0 

/*  TEST  FOR  ROUTINES  WHICH  CANNOT  BE  TRACED  */ 
int  CanInstrument(Proc  *p) 

{ 

const  char*  name  =  ProcFileName(p) ; 

retum(strcmp(". ./src/kemel/arch/alpha/locore.  s" .name)  !=0  ft& 
strcmpC". ./. ./. ./. ./src/kemel/arch/alpha/lockprim.s",name) !=0 
strcmpC ./src/kemel/arch/alpha/spl.s",name)  !=0)  ; 

> 


/*  INSTRUMENT:  */ 
/*  ALL  DATA  REFERENCES  AND  */ 
/*  SETS  OF  8  INSTRUCTIONS  OR  LESS  */ 
/*  (WITHIN  SAME  BASIC  BLOCK)  */ 
/*  ANALYSIS  ROUTINES:  */ 
/*  INSTRUCTION  FETCH(ADDRESS,PID, NUMBER)*/ 
/*  DATA  LOAD (ADDRESS, PID)  */ 
/*  DATA  STORE (ADDRESS. PID)  */ 


unsigned  InstrumentAlKint  argc,  char**  argv) 

{ 

Obj*  o; 

Proc*  p; 

Block*  b; 

Inst*  i; 

/*  ADD  PROCEDURE  PROTOTYPES  */ 

AddCallProto(**initcache()") ; 

AddCallProtoC'instref (REGV,  int,  int)"); 
AddCallProtoC'readref (VALUE,  int) ") ; 

AddCallProtoC'writref (VALUE,  int)") ; 
AddCallProto("skipcall(REGV,  REGV)") ; 

/*  ADD  INITIALIZATION  CALL  */ 

AddCallProgr am (PrograitiBef ore,  "init cache" )  ; 

/*  ITERATE  THROUGH  ORIGINAL  CODE  ADDING  REFERENCE  CALLS  */ 
o  =  GetFirstObjO ; 
if  (BnildObj (o))  return  1; 
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p  =  GetNamedProcChardclock") ; 

/*  ADD  CALL  FOR  HARDCLOCK  SCALING  */ 

AddCallProcCp,  ProcBefore,  "skipcall**,  REG_SP,  REG_RA) ; 
for  (p=GetFirstObjProc(o) ;  p!=NULL;  p=GetlIextProc(p)) 

{ 

if  (  Caninst nunent  (p)  ) 

{ 

for  (b=GetFirstBlock(p) ;  bl=irULL;  b=GetNextBlock(b) ) 

long  pcEnd  =  InstPC(GetLastInst(b)) ; 
int  count  =  0; 

for  (i=GetFirstInst(b) ;  i!=NULL;  i=GetNextInst(i) ) 

/*  INSTRUCTION  FETCH  ♦/ 
if  ((count  &  7)  ==  0) 

{ 

int  instRem  =  ( (pcEnd-InstPC(i))/4)+l; 
int  instrLine  =  (instRem  >8)  ?  8  :  instRem; 

AddCallInst(i, Inst Before,  "instref”,  REG_PC,  PROCNUM,  instrLine); 

> 

count ++; 

/*  DATA  LOAD  */ 

if  (IsInstType(i,  InstTypeLoad)) 

AddCallInst(i,  InstBefore,  “readref*',  EffAddr Value ,  PROCNUM); 

/*  DATA  STORE  ♦/ 

if  (IsInstT3rpe(i,  InstTypeStore)) 

AddCalllnst (i,  InstBefore,  "writref",  EffAddr Value ,  PROCNUM); 

} 

> 

> 

} 

WriteObj (o) ; 
retum(O)  ; 

} 
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A. 5  Kernel  Analysis  File 

The  kernel  analysis  file  kern. anal. c  defines  the  analysis  routines  called  in  the  instrumen- 
tation  file,  and  any  other  utility  functions/procedures.  There  are  4  analysis  routines  to  consider: 

Initialization  The  initialization  routine  is  responsible  for  establishing  the  bcisic  simulation  param¬ 
eters  when  the  kernel  is  loaded.  The  simulator  is  essentially  put  into  a  paused  simulation  state 
(0  caches)  so  that  it  is  not  actively  capturing  and  processing  references  until  a  test  program  is 
started. 

Hardclock  Scaling  This  procedure  will  discard  a  certain  number  of  hardclock  interrupts  controlled 
by  a  scaling  factor. 

Instruction  Fetch  Routine  The  instruction  fetch  routine  is  responsible  for  servicing  instruction 
fetches  in  the  reference  stream.  It  processes  each  set  of  references  in  the  cache  based  on  the 
sets  starting  address,  the  number  of  instructions  in  the  set,  and  the  PID  of  the  sending  process. 
Using  a  PID  allows  the  same  code  to  be  used  for  each  process’s  analysis  routines  as  well  as 
maintaining  cache  coherency. 

Data  Load  Routine  The  data  load  routine  is  responsible  for  servicing  the  data  loads  in  the  refer¬ 
ence  stream.  It  is  almost  identical  to  the  previous  routine  except  for  the  necessity  of  determin¬ 
ing  which  cache  to  access  depending  on  a  split  or  unified  model,  and  the  fact  that  it  services 
only  a  single  reference  at  a  time. 

Data  Store  Routine  The  analysis  routine  for  data  stores,  it  is  almost  identical  to  the  data  load 
routine  except  for  incrementing  different  counters. 

The  similarities  between  each  routine  would  suggest  that  the  common  aspects  be  defined  in  a  separate 
function  which  is  called  by  each  analysis  routine,  but  this  increases  the  processing  latency  by  an 
unacceptable  degree.  The  data  used  by  these  routines  is  defined  in  the  library  file  and  is  implemented 
as  global  variables. 

/*  KERN. ANAL. C  ♦/ 

/*  KERNEL  ANALYSIS  FILE  */ 

/*  JOHN  FRASER  */ 

/♦  HARDCLOCK  SCALING  VALUE 
#defiiie  SCALE  3 

#iiiclude  “caclie.h” 

#include  <stdio.li> 

#include  <c_asm.h> 

/♦  SHARED  CACHE  DATA  */ 
datablock  satom; 

/*  HARDCLOCK  SCALING  DATA  */ 
int  clockscale  =  1; 
int  clockcount  =  0; 

/*  INITIALIZE  BASIC  PARAMETERS  ♦/ 

/♦  SIMULATION  (CAPTURE)  DISABLED  ♦/ 
void  initcacheO 

satom. numcaches  =  0; 
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s atom. act caches  =  0; 
satom.numtasks  =  0; 
s  atom,  curt  ask  =  0; 
satom. count  =  0; 
clockscale  =  SCALE; 
clockcount  =  0; 
return; 

} 

/♦  HARDCLOCK  SCALIITG  */ 

void  skipcall  (unsigned  long  sp,  unsigned  long  ra) 

{ 

clockcount ++; 

if  (clockcount  >=  clockscale) 

clockcount  =  0; 
return; 

} 

asmC'mov  y,a0,  ysp",sp); 
asm(”mov  Xal,  y,ra’',ra); 
asm('*ret  Xzero,  (y*ra)*‘); 
return; 

> 

/*  SCALING  EMERGENCY  */ 
void  KernelPanicO 

•C 

clockscale  =  1; 
return ; 

> 

/*  INSTRUCTION  REFERENCE  ROUTINE  ♦/ 

void  instrefdong  addr,  int  proc,  int  count) 

{ 

int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/♦  PAUSE  CAPTURE  (RE-ENTRANCE)  ♦/ 
int  tempnumcaches  =  satom.numcaches; 
satom.  numc  aches  =  0; 

/*  PROCESS  REFERENCES  IN  EACH  CACHE  ♦/ 
for  (cnum=0;  cnum<tempnumcaches ;  cnum++) 

int  assoc  =  ( satom. pcira [cnum]  )  .assoc [0]  ; 

/*  UPDATE  STATISTICS  */ 

((satom. stat  [cnum] [proc] ) . instcnt)  +=  count; 

/♦  PARSE  ADDRESS  */ 

aline  =  (addr  &  (  satom.  par  a  [cnum]  ).  lmask[0]  )  » 
(satom.  para  [cnum]  )  .  1  shift  [0]  ; 
atag  =  addr  »  ( satom. para  [cnum]  )  .tshift[0]  ; 
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/♦  UPDATE  'USE  BITS'  AND  CHECK  FOR  HIT  ♦/ 
hit  =  0; 

lor  (x=0;  x<assoc;  x++) 

( (sat om. data Ccnum] [0] [aline] [x]).use)++; 
if  (( (satom.dat a [cmim] [0] [aline] [x]) .tag  ==  atag)  && 
((satom.data[cnnin]  [0]  [aline]  [x])  .task  ==  proc)) 

{ 

(satom.data[cnuin]  [0]  [aline]  [x]  )  .use  =  0; 
hit  =  1; 

} 

} 

/♦  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  ♦/ 
if  (hit  ==  0) 

{ 

/+  FIND  LRU  */ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

{ 

if  ( ((satom. data [cnum] [0] [aline] [x] ) .use  >=  leastused)  II 
((satom.data[cnuia]  [0]  [aline]  [x]  )  .task  == 

s  at  om .  numt  ask  s  )  ) 

•C 

leastused  =  (satom. data [cnum] [0] [aline] [x] ) .use; 
leastx  =  x; 

> 

if  ((satom. data [cnum] [0] [aline] [x] ) .task  == 

s at om . numt ask s ) 

X  =  assoc; 

> 

/♦  UPDATE  STATISTICS  */ 

( (satom. st at [cnum] [proc] ) .instmisscnt)++; 

((satom. St at [cnum] [proc] ) . interfere [ 

(satom. data [cnum] [0] [aline] [leastx] ) .task] )++; 

/*  UPDATE  CACHE  DATA  ♦/ 

( satom. data [cnum] [0] [aline] [leastx]) .tag  =  atag; 

(satom. data [cnum] [0]  [aline] [leastx]) .use  =  0; 

( satom. data[cnum] [0] [aline] [leastx]) .task  =  proc; 

> 

> 

/*  RESUME  CAPTURE  ♦/ 

satom.  numcaches  =  tempnumc  aches; 

return; 

} 

/*  DATA  LOAD  ROUTINE  ♦/ 

void  readref(long  addr,  int  proc) 

{ 

int  index ; 

int  X,  leastx; 

unsigned  long  leastused; 
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long  aline,  atag; 
int  cmim,  hit; 

/+  PAUSE  CAPTURE  (RE-ENTRANCE)  */ 
int  tempnumc aches  =  satom.nnmcaches; 
satom.mimcaches  =  0; 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  ♦/ 
for  (cniim=0;  cnnin<teinpnnmcaches ;  cnum+t) 

int  t3rpe  =  (satom.paraCcnnm] )  . type; 

int  assoc  =  (satom.paraCcnnm]  )  .assoc [type]  ; 

/*  UPDATE  STATISTICS  ♦/ 

((satom. stat Ccnum] [proc] ) .readcnt)++; 

/♦  PARSE  ADDRESS  ♦/ 

aline  =  (addr  &  (satom.paxaCcnnm])  .lmask[t3rpe] )  » 
(satom.paraCcnnm]  )  .1  shift  Ctype]  ; 
atag  =  addr  »  (satom.paraCcnnm]  )  .tshift  Ctype]  ; 

/♦  UPDATE  ^USE  BITS^  AND  CHECK  FOR  HIT  */ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

{ 

((satom.  data  Ccnum]  Ctype]  Caline]  Cx]  )  .nse)++; 
if  ( ((satom. dataCcnnm] Ctype] Caline] Cx] ) .tag  ==  atag)  && 
((satom. data Ccnum] Ctype] Caline] Cx] ) .task  ==  proc)) 

(satom.  data  Ccnum]  Ctype]  Caline]  Cx]  )  .use  =  0; 
hit  =  1; 

> 

} 

/♦  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  ♦/ 
if  (hit  ==  0) 

/♦  FIND  LRU  ♦/ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

•C 

if  ( ((satom. data Ccnum] Ctype] Caline] Cx]) .use  >=  leastused)  || 
((satom. data Ccnum] Ctype] Caline] Cx]) .task  == 

satom. numt  asks ) ) 

{ 

leastused  =  (satom.  data  Ccnum]  Ctype]  Caline]  Cx]).use; 
leastx  =  x; 

> 

if  ((satom. data Ccnum] Ct3pe] Caline] Cx] ) .task  == 

s  at  om .  numt  ask  s  ) 

X  =  assoc; 

> 

/*  UPDATE  STATISTICS  ♦/ 

((satom. stat Ccnum] Cproc]) .readmisscnt)++; 

( (satom. stat  Ccnum] Cproc] ) . interfere  C 
(satom.  data  Ccnum]  Ct3pe]  Caline]  Cleastx]  )  .task]  )++; 

/*  UPDATE  CACHE  DATA  */ 
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(satom.dataCcnum] [type] [aline] [leastx] ) .tag  =  atag; 
(satom.data[cmiin]  [t3rpe]  [aline]  [leastx])  .use  =  0; 

(satom. dat a [cnnm] [type] [aline] [leastx] ) .task  =  proc; 

} 

} 

/♦  RESUME  CAPTURE  */ 
satom.mimcaches  =  tempnnmc  aches; 
return; 

} 

/*  DATA  STORE  ROUTINE  ♦/ 

void  writrefdong  addr,  int  proc) 

{ 

int  index; 
int  X,  leastx; 
unsigned  long  leastnsed; 
long  aline,  atag; 
int  cnum,  hit; 

/♦  PAUSE  CAPTURE  (RE-ENTRANCE)  */ 
int  tempnumcaches  =  satom.numcaches; 
s  at  om.numc  aches  =  0; 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  ♦/ 
for  (cnum=0;  cnum<tempnumcaches ;  cnuin++) 

{ 

int  type  =  ( sat om. para [cnum] ) .type; 

int  assoc  =  (  s  atom,  para  [cnum]  )  .assoc  [t3rpe]  ; 

/*  UPDATE  STATISTICS  ♦/ 

((satom.stat [cnum] [proc]) .writcnt)++; 

/*  PARSE  ADDRESS  */ 

aline  =  (addr  &  ( satom . para [cnum] ) . Imask [type] )  » 

(s  atom,  para  [cnum])  .Ishift  [type]  ; 
atag  =  addr  »  (satom.paxa [cnum]  )  .tshift  [type]  ; 

/*  UPDATE  'USE  BITS'  AND  CHECK  FOR  HIT  */ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

((satom. data [cnum] [type] [aline] [x] ) .use)++; 
if  (( (satom. data[cnuin]  [type]  [aline]  [x])  .tag  ==  atag)  && 
( (satom. data[cnum]  [t3rpe]  [aline]  [x])  .task  “  proc)) 

{ 

(satom.  data  [cnum]  [t3rpe]  [aline]  [x])  .use  =  0; 
hit  =  1; 

} 

} 

/*  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  */ 
if  (hit  ==0) 

{ 

/*  FIND  LRU  */ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

{ 


115 


if  (( (satom.dat a [cnum] [type] [aline] [x]) .use  >=  leastused)  I  I 

(( sat om. data [cnum] [type] [aline] [x]) .task  ==  sat om.numt asks)) 

{ 

leastused  =  ( s atom. data [cntun]  [type]  [aline]  [x]).use; 
leastx  =  x; 

} 

if  ( (satom. data [cnum] [type] [aline] [x] ) .task  ==  sat om.numt asks) 

X  =  assoc; 

} 

/♦  UPDATE  STATISTICS  */ 

(  (  s  at  om .  St  at  [cnum]  [pro  c]  )  .  writ  mi  s  s  cnt )  ++  ; 

( (satom.  stat  [cnum]  [proc]  )  .interf  ere[ 

(satom. data [cnum] [type] [aline] [leastx] ) .task] )++; 

/*  UPDATE  CACHE  DATA  */ 

(satom.  data  [cnum]  [tjrpe]  [aline]  [leastx]  )  .tag  =  atag; 

(satom. data [cnum] [type] [aline] [leastx]) .use  =  0; 

(satom. data[cnum]  [type]  [aline]  [leastx] )  .task  =  proc; 

> 

> 

/♦  RESUME  CAPTURE  */ 

s at om.numc aches  =  tempnumcaches ; 

return; 

> 
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A. 6  Program  Instrumentation  File 

The  program  instrumentation  file  prog .  inst .  c  is  not  substantially  different  from  the  kernel 
version.  The  primary  change  is  the  removal  of  the  test  for  specific  procedures  which  cannot  be 
instrumented.  The  other  alteration  is  the  inclusion  of  a  procedure  at  program  end  to  write  the 
simulations  results  to  file.  If  multiple  test  programs  are  used,  each  uses  a  different  instrumentation 
file  with  a  unique  process  identifier  assigned  in  the  #def  ine  statement. 

/*  PROG. INST. C  ♦/ 

/♦  PROGRAM  INSTRUMENTATION  FILE  ♦/ 

/*  JOHN  FRASER  */ 

#iiLclude  <string.h> 

#include  <cmplrs/atoin .  inst  .h> 

/+  DEFINE  PROCESS  ID  ♦/ 

#define  PROCNUM  1 


/*  INSTRUMENT:  */ 
/*  ALL  DATA  REFERENCES  AND  */ 
/♦  SETS  OF  8  INSTRUCTIONS  OR  LESS  */ 
/*  (WITHIN  SAME  BASIC  BLOCK)  ♦/ 
/*  ANALYSIS  ROUTINES  */ 
/*  INSTRUCTION  FETCH (ADDRESS , PID , NUMBER) */ 
/♦  DATA  LOAD (ADDRESS, PID)  */ 
/+  DATA  STORE(ADDRESS,PID)  */ 


unsigned  InstrumentAlKint  argc,  char**  argv) 

Obj*  o; 

Proc*  p; 

Block*  b; 

Inst*  i; 

/*  ADD  PROCEDURE  PROTOTYPES  */ 
AddCallProto(**initcache(int)**) ; 

AddCallPr ot o  ( **  ins tr ef  ( REGV ,  int ,  int )  ”  )  ; 
AddCallProto(*'readref  (VALUE,  int)”)  ; 

AddCallProto(”writref (VALUE,  int)”) ; 

AddCallProtoC'printres (int)”) ; 

/*  ADD  INITIALIZATION  CALL  */ 

AddCallProgram(ProgramBefore,  "initcache”,  PROCNUM); 

/*  ADD  RESULTS  OUTPUT  CALL  */ 

AddCallPrograin(PrograinAfter,  ”printres”,  PROCNUM); 

/*  ITERATE  THROUGH  ORIGINAL  CODE  ADDING  REFERENCE  CALLS  */ 
o  =  GetFirstObjO ; 
if  (BuildObj (o))  return  1; 

for  (p=GetFirstObjProc(o) ;  p!=NULL;  p=GetNextProc(p) ) 

for  (b=GetFirstBlock(p);  b!=NULL;  b=GetNextBlock(b)) 

{ 

long  pcEnd  =  InstPC(GetLastInst(b)) ; 
int  count  =  0; 

for  (i=GetFirstInst(b) ;  i!=NULL;  i=GetNextInst (i)) 
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{ 

if  ((count  &  7)  ==  0) 

{ 

int  instRem  =  ( (pcEnd-InstPC(i))/4)+l ; 
int  instrLine  =  (instRem  >8)  ?  8  :  instRem; 

AddCalllnst (i,InstBefore,  "instref",  REG^PC,  PROCNUM,  instrLine); 

> 

count ++; 

if  (IsInstType(i,  InstTypeLoad) ) 

AddCalllnst (i,  InstBefore,  "readref*,  Ef f AddrValue ,  PRO CRUM ) ; 
if  (IsInstType(i,  InstTypeStore) ) 

AddCalllnst (i,  InstBefore,  "writref**,  Eff AddrValue ,  PRO CRUM ) ; 

} 

} 

> 

WriteObj (o) ; 
retum(O) ; 

} 
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A. 7  Program  Analysis  File 

The  program  analysis  file  prog. anal. c  is  almost  identical  to  the  kernel  version,  except 
for  the  initialization  and  conclusion  routines.  The  reference  processing  routines  perform  the  same 
function,  the  other  two  are  described  below: 

Initialization  The  initialization  routine  is  much  more  complex  than  its  kernel  equivalent.  First  it 
must  map  the  shared  data  into  the  program’s  address  space  via  the  /dev/mmap  utility.  If  the 
test  program  is  the  first  to  be  executed  for  that  simulation,  it  also  reads  the  simulation  data 
from  the  input  file,  initializes  the  cache  data,  and  enables  the  simulation. 

Conclusion  The  final  routine  is  not  present  in  the  kernel  because  it  is  executed  at  program  com¬ 
pletion.  It  is  responsible  for  writing  the  simulation  results  to  the  output  file. 

A  PROG. ANAL. C  ♦/ 

/*  PROGRAM  ANALYSIS  FILE  +/ 

/♦  JOHN  FRASER  */ 

#include  <stdio.h> 

#include  <sys/types .h> 

#iiiclude  <sys/irmian.h> 

#include  <sys/stat  .1l> 

#include  <sys/errno.h> 

#include  <fcntl.h> 

#include  <mach/niachine/vm_paraia.h> 

#include  "cache  .h*' 

/*  /DEV/MMAP  DEFINITIONS  */ 

#define  k2phys(addr)  (((long) (addr))  &  Oxffffffff) 

#define  SM.MODE  (MAP_FILE|MAP^VARIABLE|MAP_SHARED) 

#define  SM^PROT  (PROT.READ |PROT_WRITE) 

/*  SHARED  CACHE  DATA  POINTER  */ 
datablock*  psatom; 

/*  ADDRESS  MAPPING  FUNCTIONS  */ 
void  FatalError(ciiar*  string) 

f print f  ( stderr ,  "ucache :  y,s\n" ,  string)  ; 
exit (1) ; 

> 

long  Get  Addr  ess  (char*  vmunixDebug,  char*  S3rmbol) 

long  addr; 

chair  command  [200]  ; 

int  fields; 

FILE*  file; 

sprintf  (command,  "nm -B  Xs  I  grep  '  ys$" , vmunixDebug,  symbol)  ; 
file  =  popen ( command,  "r")  ; 
if  (file==NULL) 

fprintf (stderr, "Open  failed:  ys\n",  command); 
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exit(l) ; 

} 

fields  =  fscanf  (file,"Oxy,lx”,&addr); 
if  (fields !=1)  FatalErrorC'Get  address  failed"); 
pclose(file) ; 
return  addr; 

} 

/♦  INITIALIZATION  ROUTINE  ♦/ 
void  initcache(int  proc) 

/*  GET  POINTER  TO  SHARED  DATA  IN  KERNEL  ♦/ 
caddr^t  sm^addr; 
size_t  length; 

off_t  sm^physbase,  sm.pgoff; 

unsigned  long  kbase  =  Get  Address  ("vmunix.  debug" ,  "sat  om")  ; 
int  fd  =  open("/dev/inem",  0_RDWR,  0); 
if  (fd<0)  Fat alErrorC "Unable  to  open  /dev/mem\n") ; 
sm_physbase  =  k2phys (alplia_trunc_page (kbase) ) ; 
sm^pgoff  =  kbase  &  (ALPHA_PGBYTES-1) ; 

length  =  alpha_round_page(sm_pgoff  +  sizeof (datablock)) ; 
sin_addr  =  ininap(NULL,  length,  SM_PRDT,  SM_M0DE,  fd,  sm_physbase) ; 
if  (sm_addr  ==  (caddr_t)-l)  FatalError("minap  f ailed\n") ; 
psatom  =  (datablock*)  ( (long) sm_ addr  I  (long)sm_pgoff ) ; 

/*  INCREMENT  PROCESS  COUNTER  */ 
p s at om- > c oun t + + ; 

/*  IF  FIRST  PROCESS,  INITIALIZE  CACHE  DATA  */ 
if  (proc  ==  1) 

int  t empnumcaches , tempnumtasks ; 
int  x,a,b,c,d; 

FILE  *input,  *output; 

/*  LOAD  BASIC  CHARACTERISTICS  FROM  FILE  */ 
input  =  f open ("cache. in", "r ") ; 
fgets(psatom->najtie[0] ,  79,  input); 
fscanf  (input ,  "y,d\n" , fttempniimtasks)  ; 
for  (x=l;  x<tempnumtasks;  x++) 

f gets (psatom->naiiie  Cx]  ,  79 ,  input)  ; 
fscanf  ( input ,  "y,d\n" , &t empnumcaches )  ; 
for  (x=0;  x<t empnumcaches;  x++) 

fscanf  (input,  "y,d\n",  &(psatom->paraCx]  )  .t3rpe)  ; 
if  ((psatom“->para[x]  )  .type  ==  0) 

fscanf  (input ,  "%d  y,d  yd\n",  &(psatom->p2Lra[x]  )  .  csizeCO]  , 

'&(psatom->paxaCx] )  .lsize[0]  , 
&(psatom->paLraCx3  )  .  assoc  [0])  ; 

else 

fscanf  (input ,  "Xd  y.d  %d  %d  y,d  yd\n",  &(psatom->paraCx]  )  .  csize  [0]  , 

&(psatom->paraCx] ) .IsizeCO]  , 
&(psatom->para[x] ) . assoc [0]  , 
&(psatom->para[x] ) .csize [1]  , 
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&(psatom->para[x]) .IsizeCl] , 
&(psatom-->para[x]  )  .  assoc  [1]  )  ; 

} 

/♦  SET  ADDRESS  HASHING  PARAMETERS  ♦/ 
for  (a=0;  a<tempiiiiincaclies;  a++) 

for  (b=0;  b<(  (psatoin->para[a] )  .t3rpe  +  1);  b++) 

{ 

(psatom->paraCa])  .tshiftCb]  =  mylog2((psatoin->para[a])  .  csize[b]/ 

(psatom‘->para[a])  .assocEb]) ; 

(psatom->paraCa]).lslLift[b]  =  mylog2(  (psatom->para[a] )  .IsizeCb]  ); 
(psatom->paraCa])  .ImaskEb]  =  ((psatoin“>paraEa3  )  .csizeEb]/ 

(psatom->paraEa3 )  .assocEb])-!; 

} 

/*  INITIALIZE  CACHE  STORAGE  */ 
for  (a=0;  a<tempiiuincaches;  a++) 

for  (b=0;  b<(  (psatom->p2traEa]  )  .type  +  1);  b++) 
for  (c=0;  c<((psatom->paraEa]) .csizeEb]/ 

(  (psatoni->para  Ea]  )  .  Isize  Eb]  * 

(psatom->paraEa]) .assocEb] )) ;  C++) 
for  (d=0;  d<(psatom->paraEa]  ). assocEb] ;d++) 

(psatom->dataEa]  Eb] Ec] Ed] ) .use  =  0; 

(psatom->dataEa]  Eb] Ec] Ed]) .task  =  tempnumtasks ; 

} 

/♦  INITIALIZE  CACHE  STATISTICS  ♦/ 
for  (a=0;  a<tempnumcaches ;  a++) 
for  (b=0;  b  < tempnumtasks;  b++) 

(psatom->stat Ea] Eb] ) . instcnt  =  0; 

(psatom->stat Ea] Eb] ) .readcnt  =  0; 

(psatom->stat Ea] Eb] ) .writcnt  =  0; 

(psatom->stat Ea] Eb]) .instmisscnt  =  0; 

(psatom->stat  Ea] Eb] ) . readmisscnt  =  0 ; 

(psatom->stat Ea] Eb]) .writmisscnt  =  0; 
for  (c=0;  c  <=  tempnumtasks;  C++) 

(psatom->stat  Ea] Eb] ) . interfere  Ec]  =  0 ; 

} 

/*  LOG  SIMULATION  DATA  TO  OUTPUT  FILE  */ 
output  =  fopen(’'caclie.out*',*'w") ; 
f print f  (output ,  “XnXnNnXnXnXnNnXn*' )  ; 

fprintf (output , "<><><><><><><><><><><><><><><><><>\n") ; 
f  printf  (output ,  "SIMULATION :  y,s"  ,psatom->name  EO]  )  ; 
fprintf (output ,"<><><><><><><><><><><><><><><><><>\n’*) ; 
fprintf  (output ,  *'\n\n\n\n")  ; 

fprintf  (output /'Number  Tasks  =  y,d\n\n'' ,  tempnumtasks )  ; 
fprintf  (  output , "  #0 :  kemelXnXn"  )  ; 

for  (x=l;  x<tempnumtasks ;  x++) 

fprintf  (output , "  #y,d:  ,x,psatom->nameEx]  )  ; 

fprintf  (output ,  **\n\n\n\n'' )  ; 

fprintf  (output ,  "Number  Caches  =  y,d\n" ,  tempnumcaches)  ; 
fprintf (output (type,  icsize,  ilsize,  iassoc. 
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dcsize,  dlsize,  dassoc)\n\n") ; 
for  (x=0;  x<tempiLiimc aches;  x++) 

fprintfC output,"  #y,d:  Xld  '/Jd  Y.Bd  •/•3d",x, 

(psatoin->para  [x]  )  .  t3rpe , 
(psatom->paraCx3  )  .  csize  [0]  , 
(psatom“>para [x] ) . Isize [0] , 
(psatom->para[x3 ) . assoc [0]  ) ; 

if  ((psatom“>para[x]  )  .type  ==  1) 

fprintf  (output,"  VJd  y,5d  y,3d" ,  (psatom~>para[x]  )  .csize [1]  , 

(psatom->para[x] ) . Isize [1] , 
(psatom->paxa[x] ) . assoc [1]  )  ; 

fprintf (output , "\n\n") ; 

> 

fprintf (output , "\f ") ; 
f close (output) ; 

/*  START  CAPTURE  &  SIMULATION  */ 
psatom-”>nuintasks  =  tempnumtasks; 
psatom-‘>nuincaches  =  tempnumcaches; 
psatoin->act  caches  =  tempnumcaches; 
psatom~>curtask  =  -1; 

> 

return; 

} 


/*  INSTRUCTION  REFERENCE  ROUTINE  */ 

void  instrefdong  addr,  int  proc,  int  count) 

int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/*  PAUSE  CAPTURE  (RE-ENTRANCE)  */ 
int  tempnumcaches  =  ps at om->numc aches; 
psatom->numcaches  =  0; 

/*  RE-ESTABLISH  AFTER  CONTEXT  SWTICH  (RE-ENTRANCE)  */ 
if  (psatom->curtask  !=  proc) 

tempnumcaches  =  psatom->act caches; 
psatom->curtask  =  proc; 

} 

/*  PROCESS  REFERENCES  IN  EACH  CACHE  ♦/ 
for  (cnum=0;  cnum<tempnumcaches;  cnum++) 

int  assoc  =  (psatom->para [cnum]  )  .assoc [0]  ; 

/*  UPDATE  STATISTICS  +/ 

(  (psatom->stat  [cnum]  [proc]  )  .  instcnt )  +=  count ; 

/*  PARSE  ADDRESS  ♦/ 

aline  =  (addr  &  (psatom->para[cnum] )  .lmask[0]  )  » 
(psatom->para[cnum]  )  .Ishift  [0]  ; 
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atag  =  addr  »  (psatom-*>para[cnuin]  )  .  tshift  [0]  ; 
/*  UPDATE  'USE  BITS'  AND  CHECK  FOR  HIT  */ 


hit  =  0; 

for  (x=0;  x<assoc;  x++) 

((psatoin-">dataCcmim]  [0]  [aline]  [x])  .use)++; 
if  (((psatoni->data[cmiin3  [0]  [aline]  [x]  )  .tag  ==  atag)  && 
((psatoin“>data[cnnm]  [0]  [aline]  [x]  )  .task  ==  proc)) 

{ 

(psatom->data[cniim]  [0]  [aline]  [x]  )  .use  =  0; 
hit  =  1; 

} 

} 

/♦  IF  NOT  HIT,  FIND  LRU  BLOCK  TO  EVICT  */ 
if  (hit  ==  0) 


/♦  FIND  LRU  ♦/ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

if  (((psatom->data[cnum] [0] [aline] [x]) .use  >=  leastused)  II 

((psatom->data[cnuin]  [0]  [aline]  [x]  )  .task  == 

ps  at  om~‘>numt  asks  )  ) 


{ 

leastused  =  (psatom->data[cnuin]  [0]  [aline]  [x]  )  .use; 
leastx  =  x; 

} 

if  ((psatom->data[cnum]  [0]  [aline]  [x]  )  .task  == 

p  s  at  om-  >numt  asks) 


X  =  assoc; 

> 

/*  UPDATE  STATISTICS  ♦/ 

(  (psatom->stat  [cnum]  [proc]  )  .  instmisscnt)++ ; 

(  (psatom->stat  [cnum] [proc] ) . interfere [ 
(psatom->data[cnum]  [0]  [aline]  [leastx]  )  .task]  )++; 

/*  UPDATE  CACHE  DATA  ♦/ 

(psatom->data[cnum]  [0]  [aline]  [leastx] ). tag  =  atag; 
(psatom->data[cnum] [0] [aline] [leastx]) .use  =  0; 
(psatom“>data[cnum]  [0]  [aline]  [leastx]  )  .task  =  proc; 
> 

} 

/*  RESUME  CAPTURE  ♦/ 
psatom“>numcaches  =  tempnumcaches; 
return; 

} 


/*  DATA  LOAD  ROUTINE  ♦/ 

void  readref(long  addr,  int  proc) 

•C 

int  index; 
int  X,  leastx; 
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unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/♦  PAUSE  CAPTURE  (RE-EHTRANCE)  */ 
int  tempnumcaches  =  p s at om~>nuinc aches; 
psatoin->nuincaches  =  0; 

/♦  RE-ESTABLISH  AFTER  CONTEXT  SWITCH  (RE-ENTRANCE)  */ 
if  (psatom->curtask  !=  proc) 

{ 

tempnumcaches  =  psatom->act caches; 
psatom->curtask  =  proc; 

> 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  */ 
for  (cnum=0;  cnum<tempnumcaches ;  cnum++) 

{ 

int  t3rpe  =  (psatom->para [cnum] )  .type; 

int  assoc  =  (psatom->para [cnum]  )  .assoc Ct3rpe]  ; 

/♦  UPDATE  STATISTICS  */ 

( (psatom->stat [cnum] [proc] ) .readcnt)++ ; 

/*  PARSE  ADDRESS  */ 

aline  =  (addr  &  (psatom->para [cnum]  )  .Imask [type]  )  » 
(psatom->para[cnum]  )  .Ishift  [type]  ; 
atag  =  addr  »  (psatom->para[cnuin]  )  .tshift  [t3rpe]  ; 

/*  UPDATE  'USE  BITS'  AND  CHECK  FOR  HIT  */ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

((psatom->data[cnum] [type] [aline] [x] ) .use)++; 
if  ( ((psatom->data [cnum] [type] [aline] [x]) .tag  ==  atag)  && 
((psatom->data[cnum]  [type]  [aline]  [x])  .task  ==  proc)) 

(psatom->data[cnum]  [type]  [aline]  [x])  .use  =  0; 
hit  =  1; 

} 

} 

/♦  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  ♦/ 
if  (hit  ==  0) 

{ 

/*  FIND  LRU  */ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

if  (((psatom->data[cnum]  [t3rpe]  [aline]  [x]  )  .use  >=  leastused)  I  I 
((psatom->data[cnum]  [t3rpe]  [aline]  [x])  .task  == 

psatom->numtasks  )  ) 

leastused  =  (psatom->data[cnum]  [t3rpe]  [aline]  [x]  )  .use; 
leastx  =  x; 

} 

if  ((psatom->data[cnum]  [type]  [aline]  [x]  )  .task  == 

p  s  at  om->numt  asks  ) 
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X  =  assoc; 

> 

/*  UPDATE  STATISTICS  ♦/ 

((psatom->stat  [cniim]  [proc])  .readmissciLt)++; 

( (psatom->stat  Ccnum] [proc] ) . interfere [ 
(psatoin’~>data[ciium]  [type]  [aline]  [leastx])  .task])++; 

/♦  UPDATE  CACHE  DATA  ♦/ 

(psatom->ciata[cniLm]  [type]  [aline]  [leastx])  .tag  =  atag; 
(psatom->data[cnnin]  [type]  [aline]  [leastx])  .use  =  0; 
(psatom->data[cnu3ii]  [type]  [aline]  [leastx] )  .task  =  proc; 
} 

} 

/*  RESUME  CAPTURE  */ 

ps  at  oia->numc  aches  =  tempnumcaches ; 

return; 

} 

/*  DATA  STORE  ROUTINE  ♦/ 

void  writrefClong  addr,  int  proc) 

{ 

int  index; 
int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/♦  PAUSE  CAPTURE  (RE-ENTRANCE)  ♦/ 
int  tempnumcaches  =  psatom->numcaches ; 
psatom->numcaches  =  0; 

/*  RE-ESTABLISH  AFTER  CONTEXT  SWTICH  (RE-ENTRANCE)  */ 
if  (psatom-> curt ask  !=  proc) 

•C 

tempnumcaches  =  p  s  at  om->act  caches; 
psatom->curtask  =  proc; 

> 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  */ 
for  (cnum=0;  cnum< tempnumcaches;  cnum++) 

int  t3rpe  =  (psatom->para [cnum]  )  .type ; 

int  assoc  =  (psatom->para [cnum]  )  .assoc [tjpe]  ; 

/*  UPDATE  STATISTICS  ♦/ 

( (psatom->stat [cnum] [proc] ) .writcnt)++; 

/*  PARSE  ADDRESS  */ 

aline  =  (addr  &  (psatom->paxa [cnum] ) .Imask [type] )  » 
(psatom->para  [cnum]  )  .  Ishif t  [t3rpe]  ; 
atag  =  addr  »  (psatom->para[cnum]  )  .tshift  [t3rpe]  ; 

/♦  UPDATE  'USE  BITS'  AND  CHECK  FOR  HIT  ♦/ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

{ 

((psatom->data[cnum]  [type]  [aline]  [x]  )  .use)++; 

if  ( ((ps at om“>data [cnum]  [type] [aline] [x] ) .tag  ==  atag) 
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((psatom->data[cimin]  [type]  [aline]  [x])  .task  ==  proc)) 

(psatom“>data[cnnm]  [type]  [aline]  [x])  .use  =  0; 
hit  =  1; 

} 

} 

/*  IF  NOT  HIT,  FIND  LRU  BLOCK  TO  EVICT  +/ 
if  (hit  ==  0) 

{ 

/♦  FIND  LRU  */ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

if  (((psatom*->data[cmiin]  [type]  [aline]  [x])  .use  >=  leastused)  I  I 
((psatom->data[cnum]  [type]  [aline]  [x]  )  .task  == 

psat  om->numt  asks ) ) 

leastused  =  (psatom->data[cnum]  [type]  [aline]  [x])  .use; 
leastx  =  x; 

} 

if  ((psatom~>data[cnuin]  [t3rpe]  [aline]  [x]  )  .task  == 

psatom->nuint  asks  ) 

X  =  assoc; 

} 

/*  UPDATE  STATISTICS  */ 

((psatom->stat [cnum]  [proc] ) . writmisscnt)++; 

( (psatom->stat [cnum]  [proc] ) . interfere [ 

(psatom->data[cnum]  [type]  [aline]  [leastx]  )  .task]  )++; 

/♦  UPDATE  CACHE  DATA  */ 

(psatom->data[cnum]  [t3rpe]  [aline]  [leastx])  .tag  =  atag; 

(p  s  at  om->dat  a  [cnum]  [type]  [aline]  [leastx])  .use  =  0; 
(psatom->data[cnum]  [type]  [aline]  [leastx] )  .task  =  proc; 

} 

} 

/*  RESUME  CAPTURE  */ 
psatom~>numcaches  =  tempnumcaches ; 
return; 

> 

/*  STORE  RESULTS  ROUTINE  */ 
void  printres(int  proc) 

int  c,x,y; 
stats  total; 

FILE*  file; 

/*  PAUSE  CAPTURE  */ 

int  tempnumcaches  =  psatom->act caches; 
psatom->numcaches  =  0; 

/*  OPEN  FILE  FOR  OUTPUT  */ 
file  =  f open( "cache. out *‘a*’ )  ; 

fprintf  (file /'DATA  AT  END  OF  PROCESS  •/•d\n",proc) ; 
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f printf  (file ,  ••<><><><><><><><><><><><><><><><><><><><><><><><><>\ii‘' )  ; 

/*  PRINT  DATA  FOR  EACH  CACHE  */ 
for  (c=0;  c<tempiniincaclies;  C++) 

{ 

f  printf  (file,  "simulation:  V.s  (data  at  end  of  process  •/,d)\n’', 

psatom->name [0] ,proc) ; 

fprintf  (f ile," - W')  ; 

f  printf  (file,  "CACHE  #  ‘/.dV,  c)  ; 

f  printf  (file,  "cache  type:  y.d  (0=iinified,  l=split)\n", 

(psatom“>para[c3 ) .type) ; 

f  printf  (file,  "icache  size:  y.dXn",  (psatom“>para[c])  .csize[0]  )  ; 
f  printf  (file,  "icache  line  size:  y,d\n" ,  (psatom->para[c]  )  .IsizeCO]  )  ; 
f  printf  (file,  "icache  associativity:  y.d\n", 

(psatom->paxa[c] ) . assoc [0] ) ; 

if  ((psatom->paraCc3)  .t3rpe  ==  1) 

f  printf  (file,  "dcache  size:  y.d\n" ,  (psatom->para[c3  )  .csizeCl3  )  ; 
f  printf  (file,  "dcache  line  size:  y,d\n" ,  (psatom“>para[c3  )  .Isize  Cl3  )  ; 
fprintf (file, "dcache  associativity:  y.d\n", 

(psatom->para[c3 ) .assoc [13 ) ; 

} 

total. instcnt  =  0; 
total.readcnt  =  0; 
total. writ cnt  =  0; 
t otal. ins tmis sent  =  0; 
total .readmiss cnt  =  0; 
total . writ mis sent  =  0; 

/♦  PRINT  PROCESS  CACHE  PERFORMANCE  *./ 
for  (y=0;  y  <  psatom-‘>numtasks;  y++) 

-C 

int  z; 

total. instcnt  =  total . instcnt  +  (psat om-“>st at Cc3  Cy3 ). instcnt ; 
total . readent  =  total . readent  +  (psatom->stat [c3  Cy3 ) . readent ; 
total. writent  =  total. writent  +  (psatom->stat [c3 Cy3 ) .writ cnt ; 
total . ins tmis sent  =  total. instmis sent  + 

(psatom->stat [c3  Cy3 ) . instmis sent ; 
total. readmiss cnt  =  total. readmiss cnt  + 

(psatom->stat  Cc3  Cy3 ) . readmissent ; 
total.writmisscnt  =  total .writmis sent  + 

(psatom->stat  Cc3  [y3 ) . writmis sent ; 
fprintf (file,"  +****+*+**\n") ; 

fprintf  (file,"  Process  #y,d\n",  y)  ; 

fprintf  (file,"  Inst  y,121u  ",  (psatom->stat[c3  Cy3  ).  instcnt)  ; 

fprintf  (file,  "Miss  y,121u  ",  (psat  om->st  at  Cc3  Cy3  )  .instmis  sent)  ; 
if  (  (psatom->stat  [c3  Cy3  )  .  instcnt  !  =  0) 
fprintf  (file,  "Perc  5(.61f",  100.0  * 

(psatom->stat  Cc3  Cy3 ) . instmis sent  / 

(psat om->s tat  Cc3  [y3 ) . instcnt) ; 

fprintf  (file,  "\n  Data  y,121u  ",  (psatom">stat  [c3  [y3  )  .readent  + 

(psatom->stat [c3  Cy3 ) .writent) ; 

fprintf  (file,  "Miss  •/•121u  ",  (psat  om->st  at  [c3  [y3  )  .readmissent  + 
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(psatom~>stat  [c]  [y]  )  .writ  mis  sent)  ; 

if  ( ((psatom->stat [c] [y] ) .readcnt+(psatom->stat  [c]  [y] ) .writent)  !=  0) 
fprintf (file/'Perc  •/•.61f",  100.0  ♦ 

(  (psatom->stat  [c]  [y]  )  .  readmissent  + 
(psatom~>stat  [c]  [y] ) . writmissent)  / 

( (psatom“>stat [c]  [y] ) . readent  + 
(psatom->stat [c] Cy] ) . writent) ) ; 
fprintf (f ile, “Xn  read  y,12lTi 

(psatom->stat [e]  [y] ) .readent) ; 

fprintf  (file, "Miss  y,121n  ",  (psat om“>st at  [e]  [y] )  .readmissent) ; 
if  ((psatom->stat[e] [y] ) .readent  !=  0) 
fprintf (file,"Pere  %.61f",  100.0  * 

(psatom->stat [e] Cy] ) .readmissent  / 
(psatom->stat [c] [y] ) .readent) ; 

fprintf  (file,  "\n  writ  y,121u  ",  (ps  at  om-->st  at  [e]  [y]  )  .writent) ; 

fprintf  (file,  "Miss  y.l21u  ",  (psatom->stat  [c]  [y]  )  .  writmissent) ; 
if  ((psatom-’>stat[e]  Cy]  )  .writent  !-  0) 
fprintf  (file,  "Pere  >(.61f",  100.0  * 

(psatom->stat Cc] Cy] ) .writmissent  / 
(psatom->stat  Cc] Cy] ) .writent) ; 

fprintf (file," \n  TOTAL  %121ii  ",  (ps atom->st at Cc] Cy] ). instent  + 

(psatom'->stat  Cc]  Cy]  )  .readent  + 
(psatom->stat  Cc]  Cy]  )  .writent)  ; 

fprintf  (file,  "Miss  iCl21u  ",  (psat  om->st  at  Cc]  Cy]  )•  instmis  sent  + 

(psatom->stat  Cc]  Cy]  )  .readmissent  + 
(psatom“>stat  Ce]  Cy]  )  .  writmissent)  ; 
if  ( ( (psatom->stat  Cc] Cy] ) . instent  + 

(psatom->stat  Cc] Cy] ) .readent  + 

(psatom->stat Cc] Cy] ) .writent)  !=  0) 
fprintf  (file,  "Pere  y,.61f",  100.0  ♦ 

(  (psatom->stat  Cc]  Cy]  ) .  instmis  sent  + 
(psatom->stat  Cc]  Cy]  )  .  readmissent  + 
(psatom“>stat  Cc]  Cy]  )  .writmissent)  / 

(  (psatom”>stat  Cc]  Cy]  )  .  ins  tent  + 
(psatom->stat  Cc]  Cy]  )  .readent  + 
(psatom“>stat  Cc]  Cy]  )  .writent)) ; 

fprintf (file, "\n  Int  (times  pro e ess  %d  overwrote: )\n",  y)  ; 

for  (z=0;  z  <=  psatom->mimtasks;  z++) 

fprintf (file, "  Proeess  %d  =  Xl21u\n",  z, 

(psatom->stat [e] Cy] ) . int erf ere Cz]  ) ; 
fprintf  (file, "  (proeess  y,d  is  invalid  data)\n", 

psatom“>nnmtasks) ; 


} 


/*  PRINT  TOTAL  CACHE  PERFORMANCE  */ 

fprintf  (file,"  +  +  +  +  +  +  • 

fprintf (file,"  TOTAL  FOR  CACHEXn"); 

fprintf (file,"  Inst  yi21u  ",  total . instent) ; 

fprintf (file, "Miss  %121u  ",  total. instmis sent) ; 

if  (total. instent  !=  0) 

fprintf (file, "Pere  %.61f",  100.0  *  total . instmis sent  / 

total . instent) ; 
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fprintf (file/‘\n  Data  %121u  ",  total.readcnt  + 

total. writ cut) ; 

fprintf (file, "Miss  %121u  ",  tot al.readmis sent  +  total .wr itmis sent) ; 
if  ((total.readcnt  +  total .writ ent)  !=  0) 
fprintf  (file,  "Perc  •/•.61f",  100.0  * 

(total. readmissent  +  total .writmis sent)/ 
(total.readcnt  +  total.writcnt) ) ; 
fprintf  (file,  "\n  read  y,121u  ",  total.readcnt); 

fprintf  (file,  "Miss  •/•121u  ",  total.readmisscnt)  ; 
if  (total.readcnt  !=  0) 

fprintf  (file,  "Perc  '/,.61f",  100.0  ♦  total.readmisscnt  / 

total . readent ) ; 

fprintf  (file,  "\n  writ  •/,121u  ",  total.writcnt); 

fprintf  (file,  "Miss  y,121n  ",  total.writmisscnt)  ; 
if  (total.writcnt  !=  0) 

fprintf  (file,  "Perc  y,.61f",  100.0  *  total.writmisscnt  / 

total.writcnt) ; 

fprintf  (file,  "\n  TOTAL  y,121u  ",  total,  instent  + 

total.readcnt  + 
total . writent ) ; 

fprintf  (file,  "Miss  y,121u  ",  total,  ins  t  mis  sent  + 

total.readmisscnt  + 
total.writmisscnt) ; 

if  ((total. instent  +  total.readcnt  +  total.writcnt)  !=  0) 
fprintf  (file,  "Perc  y..61f",  100.0  * 

(total,  instmissent  + 
total.readmisscnt  + 
total.writmisscnt)  / 

(total . instent  + 
total.readcnt  + 
total.writcnt)) ; 

fprintf (file, "\n") ; 
fprintf (file, "\f") ; 

} 

f close(f ile) ; 

/*  IF  LAST  PROCESS,  SHUT  DOWN  SIMULATION  */ 
psatom->coxint — ; 
if  (psatom->coiint  >  0) 

psatom->mimcaclLes  =  tempniimcaclies; 
psatom->ciirtask  =  proc; 

} 

return; 

} 
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A. 8  Sample  Tool  Description  File 

To  create  an  ATOM  tool,  a  tool  description  file  must  be  created  which  defines  the  various 
tool  characteristics  such  as  the  files  to  incorporate  and  control  flags  to  use.  An  example  is  shown 
below,  which  is  the  tool  used  to  create  the  executable  version  of  the  kernel  kexe.desc.  For  more 
information,  please  refer  to  the  ATOM  source  documents. 

IlfST^FILE  kem .  inst .  c 

AML_FILE  kern .  anal .  c 

AML.LDFLAGS  -non.sheired 

ATOM.REQ  -Xkernel  “Xgprog 

ATOM^DEF  -o  vmunix . cache 

Another  tool  example  is  the  one  used  for  the  context  switch  model,  mod.desc,  which  shows 
the  -Im  flag  required  to  use  functions  from  the  libm.a  library. 

mST^FILE  prog .  inst .  c 

ANAL_FILE  model . anal . c 

ANAL_LDFLAGS  -Im 
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A. 9  Model  Library 

The  following  file,  model. h,  was  used  as  a  procedure  library  for  the  context  switch  model 
implementation.  It  is  used  in  conjunction  with  the  cache  model  library. 

/*  MODEL. H  */ 

/♦  CONTEXT  SWTICH  MODEL  LIBRARY  */ 

/*  JOHN  FRASER  */ 

#include  <stdlib.h> 

#i3iclude  <math.h> 

/*  COMPUTE  RANDOM  EXECUTION  INTERVAL  */ 
long  compintO 

long  temp  =  randomO; 

temp  =  (long)  trunc  (*“50000.  O*log(l  .0"  (randomO /(pow  (2 .0,31 .0)-l  .0))  )) ; 

/*  INTERVAL  CAP  +/ 
if  (temp  >  250000) 
retum(250000) ; 
else 

return  (temp) ; 

} 

/*  COMPUTE  FACTORIAL  FUNCTION  ♦/ 
double  myf act (long  x) 

if  (x  ==  0) 
retuxn(l  .0) ; 
else 

ret  urn  ((double)  x  ♦  myf  act(x-l)  )  ; 

} 

/♦  COMPUTE  COMBINATORIAL  FUNCTION  ♦/ 
double  mycomb(long  F,  long  i) 

{ 

long  x; 

double  temp3  =  1.0/myfact(i) ; 

/*  CANT  USE  STANDARD  FACTORIAL  EXPRESSION  =>  OVERFLOW  ERROR  */ 
for  (x=F;  x>F-'i;  x — ) 
temp3  =  temp3  +  x; 
return (temp3) ; 

> 

/♦  COMPUTE  BLOCK  OVERWRITE  PROBABILITY  */ 

double  calcprobdong  F,  int  C,  int  B,  int  A,  int  i) 

int  x; 

double  temp2  =  0.0; 
int  N  =  C/(B*A); 
if  (i  <  A) 

{ 
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double  a,b,c; 

a  =  (double) (my comb (F, i) ) ; 

b  =  (double) (pow((1.0/(double)N), (double) i)); 

/♦  UTOERFLOW  TEST  FOR  LAST  TERM  */ 
if  ((F-i)*log(1.0-(1.0/(double)N))  <  -600.0) 
c  =  0; 
else 

c  =  (double)po¥((1.0-(l.0/(double)N)),((double)(F-i))); 
retum(a*b*c) ; 

} 

else 

for  (x=0;  X  <  A;  x++) 

temp2  =  temp2  +  ( (double) (mycomb(F,x) )  * 

(pow((1.0/N) ,x))  ♦ 
(pow((1.0-(1.0/N)),(F-x)))); 

retumd.O  -  temp2); 

> 

/*  COMPUTE  INSTRUCTION  FOOTPRINT  ♦/ 
long  ifootdong  R,  int  B) 

retuxn((long)trunc(R/(50.0*B)))  ; 

} 

/+  COMPUTE  DATA  FOOTPRINT  */ 
long  dfoot(long  R) 

r etum(( long )t rune (R/50. 0) )  ; 

} 
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A.  10  Model  Analysis  File 

The  files  used  to  test  the  context  switch  model  were  very  similar  to  those  used  in  the  first  set 
of  simulations.  The  program  instrumentation  file  was  identical,  and  the  analysis  file  model .  anal .  c 
was  generally  the  same,  although  with  the  addition  of  the  model  code  cis  shown.  Since  the  model 
was  tested  with  a  single  process  trace,  the  re-entrance  mechanisms  were  not  required. 

/*  MODEL. AML. C  */ 

/♦  PROGRAM  AMLYSIS  FILE  */ 

/♦  W/  CONTEXT  SWITCH  MODEL  */ 

/*  JOHN  FRASER  */ 

#include  <stdio.lL> 

#include  *' cache. h" 

#include  *'model.h" 

/♦  CACHE  DATA  */ 
datable ck  satom; 
datablock*  psatom; 

/*  MODEL  DATA  ♦/ 
unsigned  long  switchnext; 
unsigned  long  switchent; 
unsigned  long  switchrec; 

/*  INITIALIZATION  ROUTINE  ♦/ 
void  init cache (int  proc) 

/*  SET  POINTER  TO  CACHE  DATA  */ 
psatom  =  ftsatom; 

/*  INITIALIZE  BASIC  DATA  ♦/ 
psatom->count  =  0; 
ps  at  om->numc  aches  =  0; 
psatom~>numtasks  =  0; 

/*  INITIALIZE  SWITCH  MODEL  */ 
switchent  =  0; 
switchrec  =  0; 
switchnext  =  compintO; 

/*  IF  FIRST  PROCESS,  INITIALIZE  CACHE  DATA  ♦/ 
psatom->count++ ; 
if  (psatom->count  ==  1) 

int  tempnumcachesjtempnumtasks; 
int  x,a,b,c,d; 

FILE  *input,  *output; 

/*  LOAD  BASIC  CHARACTERISTICS  FROM  FILE  ♦/ 
input  =  fopenC'cache.in" ; 
fgets(psatom->name[0] ,  79,  input); 
f  scan! (input ,  •'•/dXn*'  ,&tempnumtasks)  ; 
for  (x=l;  x<tempnumtasks;  x++) 

fgets(psatom“>nameCx] ,  79,  input); 
fscanf  (input , *'%d\n” , fttempnumc aches)  ; 
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for  (x=0;  x<tempiniinc aches;  x++) 

{ 

fscanf  (input ,  "y*<i\n'*,  &(psatom->pcu:aCx]  )  .t3rpe)  ; 
if  ( (psatom->para[x] ) .type  ==  0) 

fscanf  (input,  ''yd  “/.d  Xd\n",  &(psatora->para[x]  )  .  csize[0]  , 

&(psatom->para[x] ) .bsizeCO]  , 
&(psatoin'->para[x]  )  .assoc [0]  )  ; 

else 

fscanf  (input  ,*'%d  %d  ^d  Xd  ’/d  y,d\n”,  &(psatora->para[x]  )  .  csize  [0]  , 

&(psatoin->para[x]  )  .bsize[0] , 
&(psatom->para[x] ) .  assoc  [0]  , 
&  (psatom->para [x] ) . csize [1] , 
&(psatom->para[x])  ,bsize[l] , 
ft  (psat om->para [x] ) . assoc [1]  ) ; 

> 


/*  SET  ADDRESS  HASHING  PARAMETERS  ♦/ 
for  (a=0;  a<tempnuin caches;  a++) 

for  (b=0;  b<((psatom~>paraCa] )  .type  +  1);  b++) 

{ 

(psatom->para  [a]  )  .  tshif  t  [b]  =  mylog2  (  (psatom->para  [a]  )  .  csize  [b]  / 

(psatom->para[a]  )  .  assoc  [b]  )  ; 
(psatom->paxa [a]  )  .  Ishif  t  [b]  =  mylog2  (  (psatom->para [a]  )  . bsize  [b]  )  ; 
(psatom->para  [a]  )  .  Imask  [b]  =  (  (psatom->p2ura[a]  )  .  csize  [b]  / 

(psatom->para[a]  )  . assoc  [b]  )~-l; 

} 

/♦  INITIALIZE  CACHE  STORAGE  */ 
for  (a=0;  a<tempnuincaches ;  a++) 

for  (b=0;  b<(  (psatoin“>para[a] )  .t3rpe  +  1);  b++) 
for  (c=0;  c<( (psatom->p2Lra[a3 )  . csize [b]  / 

((psatoin~>paraCa])  .bsize[b]  * 

(psat om->para [a] ) . assoc [b] ) ) ;  C++) 
for  (d=0;  d<(psatom->para[a] ) . assoc [b] ;d++) 

(psatom“>data[a3  [b]  [c]  [d])  .use  =  0; 

(psatom->dataCa3  [b]  [c]  [d])  .task  =  tempnumtasks; 

} 

/*  INITIALIZE  CACHE  STATISTICS  ♦/ 
for  (a=0;  a<teinpnumc aches;  a++) 
for  (b=0;  b  <tempnuintasks ;  b++) 

{ 


(psatom->stat  [a]  [b]  )  .  instcnt  =  0 ; 
(psatom->stat  [a]  [b]  )  .readout  =  0; 
(psatom->stat  [a]  [b]  )  .  writcnt  =  0; 
(psatom->stat  [a]  [b]  )  .  instmisscnt  =  0; 
(psatom->stat  [a]  [b]  )  .readmisscnt  =  0; 
(psatom->stat  [a]  [b]  )  .writmisscnt  =  0; 
for  (c=0;  c  <=  tempnumtasks;  C++) 

(psat  om->s  tat  [a]  [b]  )  .  interfere  [c]  =  0 ; 

} 


/♦  LOG  SIMULATION  DATA  TO  OUTPUT  FILE  ♦/ 
output  =  fopen(*'cache.out*‘,"w") ; 
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Iprintf  (output ,  *'\n\n\n\u\ii\n\n\n")  ; 

f print f (output ,  »’<><><><><><><><><><><><><><><><><><><><><><>\n”) ; 
f print! (output ,  ’‘SIMULATION  (single)  :  %s“  ,psatom->name  [0]  ) ; 
f print! (output ,  *'<><><><><><><><><><><><><><><><><><><><><><>\r" ) ; 
!print!  (output ,  ”\n\n\n\n"  )  ; 

!print! (output,  "Number  Tasks  =  y.dXnXn"  ,tempnuintasks)  ; 

!or  (x=l;  x<tempnumtasks;  x++) 

!print! (output , "  tf'/id :  */s\n"  ,x,psatom->name  [x] ) ; 

!print!  (output ,  "\n\n\n\n"  )  ; 

!print!  (output ,  "Number  Caches  =  y,d\n"  ,tempnxiincaches)  ; 

!print!(output , "  (type,  icsize,  ibsize,  iassoc, 

dcsize,  dbsize,  dassoc)\n\n") ; 

!or  (x=0;  x<tempnumc aches;  x++) 

!print! (output ,"  #y,d:  '/.Id  y,7d  y,5d  y,3d",x, 

(psatora“>para[x]  )  .t3rpe, 
(psatom->para[x] ) . csize [0] , 
(psatom~>paraCx] ) .bsize[0] , 
(psatom->para [x] ) . assoc [0]  ) ; 

i!  ((psatom->para[x]  )  .t3rpe  ==  1) 

!print!( output,"  y,7d  y.5d  y,3d" ,  (psatom->p2Lra[x]  ). csize [1]  , 

(psatom'->paraCx]  )  .bsizeCl]  , 
(psatom->paraCx]  )  .  assoc  [1]  )  ; 

!print! (output ,  "\n\n")  ; 

> 

!print!  (output ,  "\!"  )  ; 

!close (output) ; 

/*  START  SIMULATION  */ 
psatom->numtasks  =  tempnumtasks ; 
psatom->numcaches  =  tempnumcaches; 

} 

retura; 

> 


/*  INSTRUCTION  REFERENCE  ROUTINE  */ 

void  instre!(long  addr,  int  proc,  int  count) 

{ 

int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/*  PROCESS  REFERENCES  IN  EACH  CACHE  */ 

!or  (cnum=0;  cnum  <  psatom~>numcaches;  cnum++) 

int  assoc  =  (ps at om->para [cnum] ) .assoc [0] ; 

/*  UPDATE  STATISTICS  */ 

(  (psatom->stat  [cnum]  Cproc]  )  .  instcnt )  +=  count ; 

/*  PARSE  ADDRESS  ♦/ 

aline  =  (addr  &  (psatom->para[cnum] ) .lmask[0] )  » 
(psatom-’>para[cnum]  )  .lshi!t  [0]  ; 
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atag  =  addr  »  (psatom->para[ciLiiin]  )  .  tshift  [0]  ; 

/*  UPDATE  ^USE  BITS^  AND  CHECK  FOR  HIT  */ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

((psatoin->data[cinim]  [0]  [aline]  [x]  )  .nse)++; 
if  (((psatoin->dataCcnxim]  [0]  [aline]  [x])  .tag  ==  atag)  && 
((psatom-">data[cniiin]  [0]  [aline]  [x]  )  .task  ==  proc)) 

(psatom“>data[cmiin]  [0]  [aline]  [x]  )  .use  =  0; 
hit  =  1; 

} 

> 

/*  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  */ 
if  (hit  ==  0) 

{ 

/*  FIND  LRU  */ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

if  (( (psatom->data[cnum] [0] [aline] [x] ) .use  >=  leastused)  II 
((psatom“>data[cnuin]  [0]  [aline]  [x]).task  == 

psatom~>nuintasks)  ) 

L 

leastused  =  (psatom“>data[cnuja]  [0]  [aline]  [x]  )  .use; 
leastx  =  x; 

} 

if  ((psatom->data[cnum]  [0]  [aline]  [x]  )  .task  == 

p  s  at  om-  >nuint  asks  ) 

X  =  assoc; 

} 

/*  UPDATE  STATISTICS  */ 

( (psatom“>stat [cnum] [proc] ) . instmisscnt)++; 

( (psatom“>stat  [cnum] [proc] ) . interfere [ 

(psatom->data[cnutn]  [0]  [aline]  [leastx])  .task])++; 

/♦  UPDATE  CACHE  DATA  ♦/ 

(psatom->data[cnum]  [0]  [aline]  [leastx]  )  .tag  =  atag; 
(psatom->data[cnum]  [0]  [aline]  [leastx]  )  .use  =  0; 
(psatom->data[cnum]  [0]  [aline]  [leastx]  )  .task  =  proc; 

> 

> 

/*  INCREMENT  SWTICH  COUNTER  */ 
switchcnt  +=  count; 

/*  CHECK  FOR  CONTEXT  SWTICH  AND  PERFORM  */ 
if  (switchcnt  >=  switchnext) 

{ 

unsigned  long  intercnt; 
long  foot; 
int  sec; 

double  prob,prbcnt; 

/♦  COMPUTE  INTERRUPTION  INTERVAL  */ 
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iiLtercnt  =  (psatom->mimtasks-l)  ♦  compintO; 

/♦  APPLY  IMPACT  TO  EACH  CACHE  */ 

for  (cnuin=0;  cnuin  <  psatom~>iLiiincaclies;  cmim++) 

/*  APPLY  IMPACT  TO  EACH  SECTION  (INST/DATA)  */ 
for  (sec=0;  sec<=(psatom->para[ciLuia]  )  .type;  sec++) 

{ 

/*  COMPUTE  FOOTPRINT  FOR  EACH  SECTION  */ 
if  (sec==0) 

{ 

foot  =  if oot( iiLtercnt,  ((psatom-*>paraCcmiin] )  .bsizeCsec]  /  4)); 
if  ((psatom->paraCcm2in]  )  .  type  ==  0) 
foot  =  foot  +  df oot (intercnt) ; 

> 

else 

foot  =  df oot (intercnt ) ; 

/*  ITERATE  THROUGH  EACH  LINE  OVERWRITING  RANDOM  BLOCK(S)  ♦/ 
for  (aline=0;  aline  <  (psatom->para[cnnin] )  .csize[sec]  / 

((psatom->paraCcniiia]  )  .bsizeCsec]  * 
(psatoin->paLraCcnii[n]  )  .  assoc  [sec]  )  ;  aline++) 

{ 

/*  GENERATE  LINE'S  PROBABILITY  */ 

prob  =  (donble)random()/(pow(2.0,31.0)-1.0) ; 

/*  COMPUTE  PROBABILITY  OF  FIRST  OVERWRITE  ♦/ 
prbcnt  =  cal cprob (foot, 

(psatoin-’>paraCcnuin]  )  .  csize  [sec]  , 
(psatom->para[cnnin]  )  .bsize  [sec]  , 
(psatom“>paxa[cmim]  )  .  assoc  [sec]  , 

0); 

/*  ITERATE  UNTIL  ALL  OVERWRITTEN  OR  PROBABILITY  FAILS  ♦/ 
for  (liit=0;  ((hit  <  (psatom->para[cniim] )  .assoc[sec] )  && 

(prob  >  prbcnt));  hit++) 

■C 

/♦  COMPUTE  PROBABILITY  OF  NEXT  OVERWRITE  ♦/ 
if  (hit  <  ((psatom->p2Lra[cmim] )  .assoc [sec]  -  1)) 
prbcnt  +=  calcprob(foot , 

(psatom“>p2Lra[cnuin]  )  .  csize  [sec]  , 
(psatom->para[cn‘uin]  )  .bsize  [sec]  , 
(psatom->para[cn'uin] )  .assoc [sec]  , 
hit+1) ; 

/*  FIND  LRU  BLOCK  TO  EVICT  */ 
leastnsed  =  0; 

for  (x=0;  X  <  (psatom->para[cmiin]  )  .assoc [sec]  ;  x++) 

/*  UPDATE  'USE  BITS'  ♦/ 

(psatom“>data[cnnin]  [sec]  [aline]  [x])  .nse++; 

if  (  (psatoin->data[cnnm]  [sec]  [aline]  [x])  .nse  >=  leastnsed) 

leastnsed  =  (psatom->data[cnnin]  [sec]  [aline]  [x])  .nse; 
leastx  =  x; 

> 
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} 

/*  UPDATE  CACHE  DATA  ♦/ 

(psatoin->data[ciniin3  [sec]  [aline]  [leastx])  .use  = 
(psatom->data[cnuin]  [sec]  [aline]  [leastx])  .task  = 
(psatoin->numtasks  -  1) ; 

> 

} 

> 

} 

/*  RESET  FOR  NEXT  INTERVAL  ♦/ 
switchrec++; 
switchcnt  =  0; 
switchnext  =  compintO; 

} 


return; 

} 


0; 


/*  DATA  LOAD  ROUTINE  */ 

void  readrefdong  addr,  int  proc) 

{ 

int  index; 
int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  */ 

for  (cnxun=0;  cnuin<psatom->nuracaches;  cnum++) 

{ 

int  type  =  (psatom->paxa[cnum]  )  .t3rpe; 

int  assoc  =  (psatom“>para [cnum]  )  .assoc [t3rpe]  ; 

/♦  UPDATE  STATISTICS  */ 

( (psatom->stat [cnum] [proc] ) .readcnt)++ ; 

/♦  PARSE  ADDRESS  ♦/ 

aline  =  (addr  &  (psatom->para[cnum] )  .lmask[type] )  » 
(psatom->p2Lra[cnum]  )  .Ishift  [type]  ; 
atag  =  addr  »  (psatom->para [cnum]  )  .tshift  [type]  ; 

/♦  UPDATE  ^USE  BITS^  AND  CHECK  FOR  HIT  ♦/ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 

{ 

((psatom->data[cnum]  [t3rpe]  [aline]  [x])  .use)++; 
if  ( ((psatom“>data [cnum]  [type]  [aline]  [x])  .tag  ==  atag)  && 
((psatom->data[cnum]  [type]  [aline]  [x])  .task  ==  proc)) 

< 

(psatom->data[cnum] [type] [aline]  [x]) .use  =  0; 
hit  =  1; 

> 

> 

/*  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  */ 
if  (hit  ==  0) 
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/*  Firo  LRU  ♦/ 
leastused  =  0; 
for  (x=0;  x<assoc;  x++) 

{ 

if  (((psatoni~>data[c]nim]  Ct3rpe]  [aline]  [x]  )  .use  >=  leastused)  I  I 
((psatom“>data[cnum]  [type]  [aline]  [x]  )  .task  == 

psatom->nuintasks)  ) 

{ 

leastused  =  (psatom“>data[cnum]  [t3rpe]  [aline]  [x]  )  .use; 
leastx  =  x; 

} 

if  ((psatom->data[cnum]  [t3rpe]  [aline]  [x]  )  .task  == 

p  s  at  om- >numt  asks  ) 

X  =  assoc; 

} 

/♦  UPDATE  STATISTICS  ♦/ 

( (psatom->stat [cnum] [proc]) .readmisscnt)++; 

( (psatom->stat  [cnum] [proc] ) . interfere [ 

(psatom->data[cnum] [type] [aline] [leastx] ) .task] )++; 

/*  UPDATE  CACHE  DATA  ♦/ 

(psatom->data[cnum] [type] [aline]  [leastx]) .tag  =  atag; 
(psatom->data[cnum]  [t3rpe]  [aline]  [leastx])  .use  =  0; 
(psatom~>data[cnum]  [t3rpe]  [aline]  [leastx])  .task  =  proc; 

} 

> 

return; 

} 

/*  DATA  STORE  ROUTINE  */ 

void  writrefClong  addr,  int  proc) 

int  index; 
int  X,  leastx; 
unsigned  long  leastused; 
long  aline,  atag; 
int  cnum,  hit; 

/*  PROCESS  REFERENCE  IN  EACH  CACHE  */ 

for  (cnum=0;  cnum<psatom->numcaches;  cnum++) 

{ 

int  t3rpe  =  (psatom->paTa [cnum]  )  .type; 

int  assoc  =  (psatom~>para[cnum]  )  .assoc  [t3rpe]  ; 

/*  UPDATE  STATISTICS  ♦/ 

((psatom“>stat [cnum] [proc] ) .writcnt)++; 

/*  PARSE  ADDRESS  ♦/ 

aline  =  (addr  &  (psatom->para [cnum] ) .Imask [type] )  » 
(psatom->para[cnum]  )  .Ishift  [t3rpe]  ; 
atag  =  addr  »  (psatom->para [cnum] ) .tshift [type] ; 

/*  UPDATE  ^USE  BITS^  AND  CHECK  FOR  HIT  */ 
hit  =  0; 

for  (x=0;  x<assoc;  x++) 
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((psatom->data[ciniin]  [type]  [aline]  [x]  )  .iise)++; 
if  (((psatoia->dataCcmiin]  [t3rpe]  [aline]  [x])  .tag  ==  atag)  && 
((psatora-'>data[cniiin]  [type]  [aline]  [x])  .task  ==  proc)) 

{ 

(psatom->data[cmim]  [type]  [aline]  [x]  )  .use  =  0; 
hit  =  1; 

} 

> 

/♦  IF  NO  HIT,  FIND  LRU  BLOCK  TO  EVICT  ♦/ 
if  (hit  ==  0) 

{ 

/♦  FIND  LRU  BLOCK  ♦/ 

leastnsed  =  0; 

for  (x=0;  x<assoc;  x++) 

{ 

if  (((psatom->data[cniiin]  [type]  [aline]  [x])  .use  >=  leastnsed)  11 
( (psatom->data[cnum]  [t3rpe]  [aline]  [x]  )  .task  == 

psatom->nnmtasks) ) 

{ 

leastnsed  =  (psatom~>data[cnnm]  [t3rpe]  [aline]  [x])  .use; 
leastx  =  x; 

> 

if  ((psatom-‘>data[cnnin]  [type]  [aline]  [x]  )  .task  == 

p  s  at  om->nnmt  asks  ) 


X  =  assoc; 

> 

/♦  UPDATE  STATISTICS  */ 

(  (psatoin'->stat  [cnnm]  [proc]  )  .  writiaisscnt)++ ; 

(  (psatom->stat  [cnnm]  [proc]  )  .  interfere  [ 
(psatom->data[cnnm]  [t3rpe]  [aline]  [leastx]  )  .task]  )++; 

/♦  UPDATE  ♦/ 

(psatom->data [cnnm] [type]  [aline] [leastx] ) .tag  =  atag; 
(ps at om~>dat a [cnnm] [type]  [aline] [leastx] ) .use  =  0; 
(psatom->data[cnum] [type] [aline]  [leastx]) .task  =  proc; 
> 


return; 

} 


/*  STORE  RESULTS  ROUTINE  */ 
void  printresCint  proc) 

{ 

int  c,x,y; 
stats  total; 

FILE*  file; 

file  =  f open (*' cache. out’*,*' a** )  ; 

f pr int f  (file, ’’DATA  AT  END  OF  PROCESS  y,d\n"  ,proc)  ; 

f printf (file , ”<><><><><><><><><><><><><><><><><><><><><><><><><>\n*' ) ; 
for  (c=0;  c<psatom->nnmcaches;  C++) 

/*  PRINT  CACHE  DATA  */ 


( 
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fprintf (file, *'simulat ion:  y,s 


(data  at  end  of  process  y,d)\n*', 
psatom~>name  [0] ,proc) ; 
fprintf  (file, "total  context  switches  modeled:  5flii\n"  ,switchrec)  ; 

fprintf  (f ile, " - \n'')  ; 

fprintf  (file, "CACHE  #  y.d\n",  c)  ; 

fprintf  (file,  "cache  t3rpe:  Xd  (0=iiiiif  ied,  l=split)\n", 

(psatom->para[c] ) .type) ; 

fprintf  (file,  "icache  size:  y,d\n" ,  (psatom->para[c]  )  .csize[03); 
fprintf (file, "icache  line  size:  %d\n" , (psatom“>para[c]  ) .bsize [0]  ) ; 
fprintf  (file,  "icache  associativity:  y,d\n", 

(psatom->para [c]  )  .  assoc  [0]  )  ; 

if  ((psatom~>para[c]  )  .  t3rpe  ==  1) 

{ 

fprintf  (file ,  "dcache  size:  y.d\n",  (psatom->paxa[c]  )  .csizeCl]  ); 
fprintf  (file,  "dcache  line  size:  y,d\n" ,  (psatom->paraCc]  )  .bsize  [1]  )  ; 
fprintf (file, "dcache  associativity:  yd\n", 

(psatom->paraCc] ) . assoc [1]  )  ; 

} 

total. instcnt  =  0; 
total .readcnt  =  0; 
total. writcnt  =  0; 
total . ins t mis sent  =  0; 
total.readmisscnt  =  0; 
total . wr it mis sent  =  0; 

/♦  PRINT  PROCESS  CACHE  PERFORMANCE  */ 
for  (y=0;  y  <  psatom->niimtasks;  y++) 

{ 

int  z; 

total . instcnt  =  total . instcnt  +  (psatom->stat [c] [y] ) . instcnt; 
total  .readcnt  =  total. readcnt  +  (psatom-'>statCc]  [y]  )  .readcnt ; 
total. writent  =  total. writent  +  (psatom->stat [c] Cy] ) .writent ; 
total. ins t mis sent  =  total . ins tmis sent  + 

(psatom->stat [c] [y] ) . instmissent ; 
total.readmisscnt  =  total.readmisscnt  + 

(psatom->stat [c] [y] ) .readmissent ; 
total.wri tmis sent  =  total. writ mis sent  + 

(psatom->stat [c] [y] ) . writmissent ; 
fprintf  (file , "  ♦**ic+***+*\n*' )  j 

fprintf  (file , "  Process  #y,d\n" ,  y)  ; 

fprintf  (file , "  Inst  y,121ii  " ,  (psatom->stat  [c]  [y]  )  .  instcnt)  ; 

fprintf  (file ,  "Miss  y,121u  " ,  (psatom->stat  [c3  Cy]  )  .  instmissent)  ; 
if  (  (psatom->stat  [c]  [y]  )  .  instcnt  !  =  0) 
fprintf  (file,  "Perc  y,.61f",  100.0* 

(psatom->stat  [c]  [y]  )  .  instmissent  / 
(psatom->stat  [c]  [y]  )  .  instcnt)  ; 
fprintf  (file,  "\n  Data  y,121n  ", 

(psatom->stat  [c]  [y]  )  .readcnt  + 
(psatom->stat  [c]  Cy]  )  .writent)  ; 

fprintf  (file,  "Miss  y,121n  ",  (psat  om->st  at  [c]  [y]  )  .readmissent  + 

(psatom->stat  [c]  Cy]  )  .writmissent)  ; 
if  ( ( (psatom“>stat  Cc] Cy] ) . readcnt  + 
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(psatoin->stat  [c]  [y]  )  .  writcnt)  !  =  0) 
fprintf (file, "P ere  >(.61f'*,  100.0  * 

( (psatom->stat [c] [y] ) .readmissent  + 
(psatom->stat  [c]  [y] ) . writmissent )  / 
( (psatom->stat  [c]  [y] ) . readent  + 
(psatoin->stat  [c]  [y] ) .  writent )  ) ; 
fprintf  (file, ’*\n  read  y,121u  **, 

(psatom->stat  [c]  [y]  )  .readent)  ; 

fprintf  (file, ’’Miss  y,121u  (psatom->stat  [c]  [y]  )  .readmissent)  ; 
if  ( (psatom->stat [e] Cy] ) .readent  ! =  0) 
fprintf  (file,  "Pere  y..61f",  100.0  * 

(psatom“'>stat  [e]  [y]  )  .readmissent  / 
(psatom->stat  [e]  [y]  )  .readent)  ; 
fprintf  (file,  **\n  writ  y.l21ii 

(psatom-">stat  [e]  [y]  )  . writent)  ; 

fprintf  (file,  **Miss  y,12lTi  ”,  (psatom->stat  [e]  [y]  )  .writmissent) ; 
if  ( (psatom->stat [e] [y] ) . writent  ! =  0) 
fprintf(file,”Pere  y..61f”,  100.0  * 

(psatom->stat [e] [y] ) .writmissent  / 
(psatom-'>stat  [e]  Cy]  )  .writent)  ; 
fprintf(file,”\n  TOTAL  y,121u  ”, 

(psatom->stat [e] Cy] ) . instent  + 
(psatom^>stat  Cc] Cy] ) .readent  + 
(psatom->stat  Ce] Cy] ) .writent) ; 

fprintf  (file, '*Miss  y,121n  ”,  (psatom~>stat  Cc]  Cy]  )  .  instmissent  + 

(psatom“>stat  Ce] Cy] ) .readmissent  + 
(psatom->stat  Cc] Cy] ) .writmissent) ; 
if  ( ( (psatom~>stat  Cc] Cy] ) . instent  + 

(psatom->stat  Ce] Cy] ) .readent  + 

(psatom->statCe] Cy] ) .writent)  !=  0) 
fprintf (file, "Pere  y.61f”,  100.0  * 

( (psatom-“>stat  Cc]  Cy]  )  .  instmissent  + 
(psatom->stat  Cc] Cy] ) .readmissent  + 
(psatom->stat Ce] Cy] ) .writmissent)  / 
( (psatom->stat  Ce] Cy] ) . instent  + 
(psatom->statCe] Cy]) .readent  + 
(psatom->stat  Cc] Cy] ) .writent)) ; 

fprintf  (file,  ”\n  Int  (times  proeess  y,d  overwrote :  )\n”,  y)  ; 

for  (z=0;  z  <=  psatom-'>nnmtasks;  z++) 

fprintf  (file,”  Proeess  Xd  =  y,121n\n”,  z, 

(psatom->stat  Cc] Cy] ) . int erf ere Cz] ) ; 
fprintf (file,”  (process  %d  is  invalid  data)\n”, 

psatom->mimtas}cs) ; 

> 

/*  PRINT  TOTAL  CACHE  PERFORMANCE  */ 

fprintf  (file  ,  ”  +  • 

fprintf (file,”  TOTAL  FOR  CACHE\n”); 

fprintf  (file,”  Inst  y,121u  ”,  total,  instent)  ; 

fprintf (file, "Miss  yi21ii  ”,  total. instmissent) ; 

if  (total . instent  1=  0) 

fprintf (file, "Pere  y.61f”,  100.0  ♦  total. instmissent  / 
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total . ins tent) ; 

fprintf  (f ile,*'\n  Data  Vtl21u  ,  total .readent  + 

total .writ ent) ; 

fprintf  (file, ’’Miss  y,121ii  ",  total.readmisscnt  +  total.writmisscnt) ; 
if  ((total. readent  +  total .writ ent)  !=  0) 
fprintf(file,"Pere  y..61f",  100.0  ♦ 

(total.readmissent  +  total .writ mis sent)/ 
(total .readent  +  t otal. writ ent) ) ; 
fprintf  (file,  "\n  read  y,121u  ",  total.readent)  ; 

fprintf  (file,  "Miss  y.l21u  ",  total.readmissent); 
if  (total.readent  !=  0) 

fprintf  (file,  "Pere  y,,61f",  100.0  *  total.readmissent  / 

total.readent) ; 

fprintf  (file,  "\n  writ  y,121u  ",  t  otal.  writ  ent )  ; 

fprintf  (file,  "Miss  y,121ii  ",  total.writmissent) ; 
if  (total .writ ent  !=  0) 

fprintf  (file,  "Pere  y,.61f",  100.0  *  total.writmissent  / 

total. writ ent) ; 

fprintf  (file,  "\n  TOTAL  y,121u  ",  total,  instent  + 

total.readent  + 
total. writ ent) ; 

fprintf  (file,  "Miss  y,121n  ",  total,  instmissent  + 

total.readmissent  + 
total . writmissent) ; 

if  ((total. ins tent  +  total.readent  +  total. writ ent)  !=  0) 
fprintf  (file,  "Pere  y,.61f",  100.0  *  (total .  instmissent  + 

total.readmissent  + 
total.writmissent)  / 

(total . instent  + 
total.readent  + 
total . writ  ent ) ) ; 

fprintf (file , "\n" ) ; 
fprintf (file , "\f ") ; 

> 

f elose(f ile) ; 

/+  IF  LAST  PROCESS,  SHUT  DOWN  SIMULATION  */ 
psatom->eonnt — ; 
if  (psatom”>eoimt  ==  0) 

•[ 

psatom~>mimeae]ies  =  0; 
psatom->mimtasks  =  0; 

> 

return; 

} 
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B  Tables  of  Simulation  Results 

Key  to  data  tables: 

Miss  Data 

•  Inst  =  instruction  fetch  misses 

•  Read  =  data  read  misses 

•  Write  =  data  write  misses 

•  Data  =  total  data  read  and  write  misses 

•  Total  =  total  misses 

•  %  =  miss  rate 

Interference  Data  (Int(95^)) 

•  Process  0  is  the  kernel,  except  for  simulations  with  the  context  switch  model  where  process  0 
is  the  test  program. 

•  Additional  process’  numbers  are  shown  in  the  same  order  as  the  tables. 

•  The  extra  process  is  for  cases  where  invalid  data  is  overwritten  (at  simulation  start). 

B.l  Compress  Alone 


Compress  data: 

Table  6 

B.2 

GCC  Alone 

GCC  data: 

Table  7 

B.3 

Espresso  Alone 

Espresso  data: 

Table  8 

B.4 

Alvinn  Alone 

Alvinn  data: 

Table  9 

B.5 

Compress  w/  Operating  System 

Compress  data: 

Operating  System  data: 

Combined  data: 

Table  10 
Table  11 
Table  12 

B.6 

GCC  w/  Operating  System 

GCC  data: 

Operating  System  data: 

Combined  data: 

Table  13 
Table  14 
Table  15 
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B.7 


B.8 


B.9 


B.IO 

B.ll 

B.12 

B.13 

B.14 

B.15 

B.16 

B.17 


Espresso  w/  Operating  System 


Espresso  data.;  Table  16 

Operating  System  data:  Table  17 

Combined  data;  Table  18 

Alvinn  w/  Operating  System 

Alvinn  data:  Table  19 

Operating  System  data:  Table  20 

Combined  data:  Table  21 


Compress  and  GCC  w/  Operating  System 

Compress  data:  Table  22 

GCC  data:  Table  23 

Operating  System  data:  Table  24 

Combined  data:  Table  25 

Compress  and  Espresso  w/  Operating  System 

Compress  data:  Table  26 

Espresso  data:  Table  27 

Operating  System  data:  Table  28 

Combined  data:  Table  29 

GCC  and  Espresso  w/  Operating  System 

GCC  data:  Table  30 

Espresso  data:  Table  31 

Operating  System  data:  Table  32 

Combined  data:  Table  33 

Compress  w/  Model,  n=l 

Compress  data:  Table  34 

GCC  w/  Model,  n=l 

GCC  data:  Table  35 

Espresso  w/  Model,  n=l 

Espresso  data:  Table  36 

Alvinn  w/  Model,  n=l 

Alvinn  data:  Table  37 

Compress  w/  Model,  n=2 

Compress  data;  Table  38 

GCC  w/  Model,  n=2 

GCC  data:  Table  39 
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Table  40 


B.18  Espresso  w/  Model,  n=2 

Espresso  data: 
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Table  6:  Compress  Alone 


157  0.0002  327671T  14.6203  18121  0.2126  3294634  10.6513  3294991  2.7928  3294578 

218  0.0003  3642992  16.2546  80614  0.9460 _ 3723606  12.0374  3723824  3.1563  3723606 

96  0.0001  3431770!  15.3122  23850  0.2799  3455620  11.1711  3455716  2.9291  3455496 

96|  O.OOOf  3376695:  15.0664  13679  0.1605  3390374  10.9601  3390470  2.8738  3390247 


Table  7:  GCC  Alone 
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Table  9:  Alvinn  Alone 
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Table  10:  Compress  w/  Operating  System,  Compress  Data 
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Table  11:  Compress  w/  Operating  System,  Operating  System  Data 
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Table  12:  Compress  w/  Operating  System,  Combined  Data 
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Table  13:  GCC  w/  Operating  System,  GCC  Data 


154 


Table  14:  GCC  w/  Operating  System,  Operating  System  Data 
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Table  16:  Espresso  w/  Operating  System,  Espresso  Data 
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Table  18:  Espresso  w/  Operating  System,  Combined  Data 
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Table  19:  Alvinn  w/  Operating  System,  Alvinn  Data 
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Table  22:  Compress  and  GCC  w/  Operating  System,  Compress  Data 
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Table  24:  Compress  and  GCC  w/  Operating  System,  Operating  System  Data 
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Table  25:  Compress  and  GCC  w/  Operating  System,  Combined  Data 
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Table  26:  Compress  and  Espresso  w/  Operating  System,  Compress  Data 
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Table  27:  Compress  and  Espresso  w/  Operating  System,  Espresso  Data 
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Table  28:  Compress  and  Espresso  w/  Operating  System,  Operating  System  Data 
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Table  29:  Compress  and  Espresso  w/  Operating  System,  Combined  Data 
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Table  31:  GCC  and  Espresso  w/  Operating  System,  Espresso  Data 
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35^67  0.1584  1021966  1.9987  203020  1.6781  1224986  1.9374 _ 1579753  0.5500  443086  681871  454796 

j51429  0.1122  632683  1.2378  138431  1.1443  771314  1.2199  1022743  0.3561  321370  456978  244395 

ig5774  0.1945  2205351  4.3131  254476  2.1035  2459827  3.8903  2895601  1.0081  422012  ^4590  1^8999 

329531  0.1471  1169631  2.3266 _ 185348  1.5321  1374979  2.1746  1704510  0.5934  451474  712621  ^0415 

252814I  0.1129{  736403  1.4402 |  12197l|  1.00621  8583741  1.35761  11111881  0.38681  3701431  503201  237844 


Table  32:  GCC  and  Espresso  w/  Operating  System,  Operating  System  Data 
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Table  33:  GCC  and  Espresso  w/  Operating  System,  Combined  Data 


174 


54451581  0.95181 


Table  34:  Compress  w/  Model,  n=:l 


1628  0.0000  3329522  0.1486  22173  0.0026  3351695  0.1084  3353323  0.0284  2998066  355022  235 

2145  0.0000  3677102;  0.1641  82525  0.0097  3759627  0.1215  3761772  0.0319  3573411  188236 _ 125 

1297  0.0000  3464412  -  0.1546  31646  0.0037  3496058  0.1130  3497355  0.0296  3301958  195253 _ 144 

1066  0.0000  3408615  i  0.1521  17548  0.0021  3426163  0.1108  3427229  0.0290  3227555  199525 _ 1^ 


Table  35:  GCC  w/  Model,  n=:l 
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5917081  0.65421  1137861}  0.4956 j  8320221  3056081 


Table  36:  Espresso  w/  Model,  n=l 
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Table  40:  Espresso  w/  Model,  n=2 
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